This shows you the differences between two versions of the page.
ep:labs:04 [2020/08/24 22:08] cristian.marin0805 [01. Ethernet Configuration Settings] |
ep:labs:04 [2025/03/24 23:32] (current) silvia.dragan |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Lab 04 - Networking Monitoring (Linux) ====== | + | ====== Lab 04 - Memory Monitoring ====== |
- | |||
- | Why is **Networking** **Important**? | ||
- | |||
- | Having a well-established network has become an important part of our lives. The easiest way to expand your network is to build on the relationships with people you know; family, friends, classmates and colleagues. We are all expanding our networks daily. | ||
===== Objectives ===== | ===== Objectives ===== | ||
- | * Offer an introduction to Network monitoring. | + | * Offer an introduction to Virtual Memory. |
- | * Get you acquainted with a few Linux standard monitoring tools and their outputs, for monitoring the impact of the Network on the system. | + | * Get you acquainted with relevant commands and their outputs for monitoring memory related aspects. |
- | * Provides a set of insights related to understanding networks and connection behavior. | + | * Introduce the concept of page de-duplication. |
+ | * Present a step-by-step guide to Intel PIN for dynamic instrumentation. | ||
===== Contents ===== | ===== Contents ===== | ||
{{page>:ep:labs:04:meta:nav&nofooter&noeditbutton}} | {{page>:ep:labs:04:meta:nav&nofooter&noeditbutton}} | ||
- | ===== Introduction ===== | + | ===== Proof of Work ===== |
- | ==== 01. Ethernet Configuration Settings ==== | + | Before you start, create a [[http://docs.google.com/|Google Doc]]. Here, you will add screenshots / code snippets / comments for each exercise. Whatever you decide to include, it must prove that you managed to solve the given task (so don't show just the output, but how you obtained it and what conclusion can be drawn from it). If you decide to complete the feedback for bonus points, include a screenshot with the form submission confirmation, but not with its contents. |
- | <note> | + | When done, export the document as a //pdf// and upload in the appropriate assignment on [[https://curs.upb.ro/2024/course/view.php?id=9907|moodle]]. The deadline is 23:55 on Friday. |
- | Unless explicitly changed, all Ethernet networks are __auto negotiated for speed__. The benefit of this is largely historical when there were multiple devices on a network at **different speeds and duplexes**. | + | |
- | Most enterprise Ethernet networks run at either 100 or 1000BaseTX. Use **ethtool** to ensure that a specific system is synced at this speed. | ||
- | In the following example, a system with a 100BaseTX card is running auto negotiated in //10BaseT//. | + | ===== Introduction ===== |
+ | <spoiler> | ||
+ | When talking about memory, one can be referring either to the CPU's cache or main memory (i.e., RAM). Since the former has been discussed (hopefully exhaustively) during other courses such as [[https://ocw.cs.pub.ro/courses/asc|ASC]], today we'll be focusing on the latter. If you feel that there's still more for you to learn about the CPU cache, check out this very well-known [[https://people.freebsd.org/~lstewart/articles/cpumemory.pdf|article]]. With that out of the way, here's a few things to keep in mind moving forward: | ||
- | {{ :ep:laboratoare:ep2_poz4.png?450 |}} | + | ** Virtual Memory ** |
- | The following command can be used to force the card into 100BaseTX: | + | Reminding you of this concept may be redundant at this point, but here goes. The programs that you are writing do **not** have direct access to the physical memory. All addresses that you are accessing from user space are translated to physical addresses by the Memory Management Unit (MMU) of the CPU. The MMU stores as many virtual -- physical address pairs as it can in its Translation Lookaside Buffer (TLB). When the TLB fills up, the least accessed addresses are flushed to make room for new ones. When a new virtual address is encountered, the CPU will look up its physical counterpart in a structure managed by the kernel. This structure is in fact a 4-level tree where each node is a list of 512 entries pointing to the next node. The leaf nodes yield the physical page address. Some of you might have already noticed something strange. an offset in the range [0; 511] can be represented using only 9 bits. Having a 4-level page table means that the offsets fit into 36 bits of the 64-bit virtual address. If we add the size of a page offset (12 bits), we're still 16 bits short. Good catch! Modern x64 CPUs, while technically using 64-bit addresses, don't support 2^64 bytes of addressable virtual memory. That being said almost nobody ever complained about this, since 2^48 is still more than anyone needs. |
- | <code bash> | + | So what are the reasons for implementing virtual memory? Simple: security, performance and convenience. Let's tackle these one by one: |
- | # ethtool -s eth0 speed 100 duplex full autoneg off | + | |
- | </code> | + | |
- | </note> | + | |
- | ==== 02. Monitoring Network Throughput ==== | + | |
- | <note important>It is impossible to control or tune the switches, wires, and routers that sit in between two host systems. The best way to test **network throughput** is to send **traffic** between two systems and __measure statistics__ like **latency** and **speed**.</note> | + | **Security:** |
+ | User space processes should not have direct access to the physical address space. If they did, they could inspect and change the memory of other processes, and possibly even the kernel's. Moreover, Every physical address that one can access (from the perspective of the kernel) is not only RAM. Some devices have memory mapped registers that the user can interact with by reading from / writing to them. E.g., a serial device driver can put a char on the wire by [[https://elixir.bootlin.com/linux/latest/source/drivers/tty/serial/imx_earlycon.c#L24|writing]] it to a certain 32-bit aligned address. Similarly, it can check whether the serial device is currently busy writing the previous character by [[https://elixir.bootlin.com/linux/latest/source/drivers/tty/serial/imx_earlycon.c#L21|reading]] a register constantly updated by said device with its status. Normally, you'd abstract the hardware from user space program by having drivers interpret requests presented by the process via system calls. By using virtual memory, even if the process has knowledge of the underlying hardware, it won't be able to access those device registers. | ||
- | === Using iptraf for Local Throughput === | + | //"But I really want to access those registers..."// you may be thinking. No worries, then: [[https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html|Userspace I/O]] (UIO) is a kernel module that allows mapping device registers to your user space process, thus enabling you to implement drivers without actually knowing anything about how kernel modules work :p. If that's not convenient enough for you, there's also ''/dev/mem''. This device essentially can be opened as a regular file (i.e.: with ''open()'') and allows you to read / write **physical** memory. This is usually done with the ''pread()'' and ''pwrite()'' syscalls, respectively. Needless to say, using either of these systems requires your process having the **CAP_SYS_ADMIN** capability (if you don't know what that means, just run it with **sudo** :p). One example where mapping devices in the virtual address space of a process is the [[https://www.dpdk.org/about/|Intel DataPlane Development Kit]] (DPDK), a user-space implementation of network drivers using UIO. DPDK is used on servers with a high traffic load to avoid performing too many context switches only to receive packets in user space. Note, however, that using UIO essentially makes the device inaccessible by kernel drivers. In the case of DPDK, the Network Interface Controller (NIC) becomes inaccessible system-wide, with the exception of the processes using it. |
- | <note>The **iptraf** utility (http://iptraf.seul.org) provides a **dashboard** of **throughput** per Ethernet interface. (Use: //# iptraf –d eth0//)</note> | + | In some //very// particular cases, you might want to know the physical addresses of your pages. On the surface, this might seem reasonable. After all, you //can// access them via virtual addressing, so why not? This //could// be done via [[https://www.kernel.org/doc/html/latest/admin-guide/mm/pagemap.html|/proc/<pid>/pagemap]], but recently it's been changed to also require **CAP_SYS_ADMIN**. The reason for this is that knowing the physical address of your memory pages can allow you to mount cache-based side channel attacks against other processes. This is not trivial threat; cache side-channels are the most common class of hardware side-channels and among the only practical ones, even in a research context. |
- | === Using netperf for Endpoint Throughput === | + | **Performance:** You should already be fairly familiar with this: processes that use the same library don't in fact have their own copy in RAM. In stead, virtual addresses to the read-only pages of a library usually point to the same physical address in RAM. The advantage here is that you don't have to load a dozen different libraries from persistent storage (i.e.: HDD, SSD, etc.) when you start up a process. Let's say that you have 1000 processes, each using **libc.so**. Having ~1.8MB of read-only pages backed by **libc.so** copied over in RAM for each process would easily exhaust ~2GB of your RAM. And that's just one library... That being said, even mapping these libraries in the virtual address space (using ''mmap()'', usually taken care of by ''ld-linux.so'' for you) is a costly operation. Looking at the American Fuzzy Lop (AFL) fuzzer, we can find an interesting optimization called [[https://afl-1.readthedocs.io/en/latest/about_afl.html|Fork Server]] that allows bypassing the problem of re-mapping all libraries in the address space on newly spawned instances of the same server by hooking the ''main()'' function and in stead of ''exec()''-ing thousands of times per second, it simply ''fork()''s the process so that the children start off with a copy of the original's address space. Fun stuff! |
- | <note>Unlike **iptraf** which is a **passive interface** that monitors traffic, the **netperf** utility __enables a system administrator__ to perform **controlled tests** of **network throughput**. This is extremely helpful in determining the throughput from a client workstation to a heavily utilised server such as a file or web server. The **netperf** utility runs in a **client/server** mode. | + | **Convenience:** Many of the points made previously can be used to justify how convenient virtual memory. One thing to add would be that even the kernel uses it. The reason for this is the ability to remap physical devices at different addresses. ARM takes this a step further with its [[https://developer.arm.com/documentation/102142/0100/Stage-2-translation|Two-Stage Address Translation]], allowing the Hypervisor (running at Exception Level 2) to fake the existence of certain devices or to more accurately emulate certain platforms. Note, however, that ARM communicates the layout of hardware components in the address space to the kernel via a Flattened Device Tree (FDT). E.g., [[https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/freescale/imx8mn.dtsi#L811|here]] the address and size of the **uart1** device is given by the **reg** property, containing a tuple representing the base address (0x30860000) and memory size that is reserved for said device (0x10000 -- 16 pages, not all used in reality). On x86-64, FDTs are not used; other systems are used to probe for available hardware. |
- | + | ||
- | To perform a basic controlled throughput test, the **netperf** server must be running on the server system (//server# netserver//). | + | |
- | There are multiple tests that the **netperf** utility may perform. The most basic test is a standard throughput test. The following test initiated from the client performs a 30 second test of TCP based throughput on a LAN. | + | ** Out Of Memory Killer ** |
- | The output shows that the throughput on the network is around 89 mbps. The server (192.168.1.215) is on the same LAN. This is exceptional performance for a 100 mbps network.</note> | + | |
- | {{ :ep:laboratoare:ep2_poz5.png?430 |}} | + | What happens when you start running out of RAM on your system? The default behavior is that the kernel chooses one or more processes to kill, this freeing up some RAM. This is known as the Out Of Memory (OOM) Killer. In order to do this, each process is assigned an OOM score. A higher score is indicative of a higher change of getting killed once the OOM Killer is woken up. The primary factor that influences this score is the amount of memory used. Modifiers that raise this value include the niceness value of the process and the number of ''fork()''s. On the other hand, being privileged, having run for a long time or performing hardware I/O reduce the likelihood of being killed. Then comes the user's preference; writing a value to ''/proc/<pid>/oom_score_adj'' (within certain limits -- decided at kernel compile time) will also tip the scales, one way or another. Writing a value just below the inferior limit will instead categorically prevent the process from being chosen. All this being said, is there an alternative to killing processes? |
- | + | ||
- | <note>Another useful test using **netperf** is to monitor the amount of **TCP request and response** transactions taking place per second. The test accomplishes this by creating a single TCP connection and then sending multiple request/response sequences over that connection (ack packets back and forth with a byte size of 1). This behavior is similar to applications such as RDBMS executing multiple transactions or mail servers __piping__ multiple messages over one connection. | + | |
- | The following example simulates TCP request/response over the duration of 30 seconds.</note> | + | ** Swap Space ** |
- | {{ :ep:laboratoare:ep2_poz6.png?450 |}} | + | The system can reserve a portion of the persistent storage devices (i.e., HDD, SSD, etc.) for the express purpose of storing RAM pages when memory starts running low. For a long time, a dedicated partition was needed to serve as swap space. Now, users can also create //swap files// on top of an existing file system and mount them as loopback devices for the swap partition. This allows easily resizing the swap space without modifying partitions. When the used memory value exceeds a certain value (high watermark), the kernel's [[https://docs.kernel.org/admin-guide/mm/damon/reclaim.html|Page Frame Reclamation]] system begins copying the least recently used pages to swap. This goes on until the amount of used memory decreases below another certain value (low watermark). When a page is evicted to swap, the corresponding Page Table Entry (PTE) from the Page Tree is modified to indicate its location in swap, instead of its (previous) physical address in RAM. |
- | <note>In the previous output, the network supported a transaction rate of 4453 psh/ack per second using 1 byte payloads. This is somewhat unrealistic due to the fact that most requests, especially responses, are greater than 1 byte. | + | We note that Swap Space is an optional feature, but having it can increase the system performance even if you don't have low memory issues. Nowadays, the kernel will try to evict pages from RAM proactively, given that they've not been accessed for a prolonged period of time. Evicting them to swap is not the only option. If a file is mapped in memory (via ''mmap()''), then the kernel will have a known copy of it in your filesystem if ever needed. So evicting **libc.so**'s pages to swap is unnecessary since there's already a copy of it in ''/usr/lib/''. This form of proactive eviction is implemented for two main reasons: 1) to avoid reaching a point where **kswapd** (the kernel swap daemon) needs to aggressively evict pages, or where the **OOM Killer** needs to kill processes (in absence of any swap), and 2) to maximize the amount of memory available for file caching without overcommitting CPU cycles to this task. The problem with not having any Swap Space is that you can only evict file-backed pages. Memory buffers (e.g.: the ''malloc()'' memory pool) are in fact generated as anonymous ''mmap()''-ed pages. Normally you would think that anonymous pages don't have a backing file but internally, the swap device is considered their point of origin. Not having any |
+ | swap device present on your system will automatically disqualify any anonymous maps from being evicted. Knowing that most anonymous pages are part of memory allocation pools that are largely underutilized, swapping out (mostly) code pages from less utilized libraries can result in performance loss due to unnecessary I/O in the long run. | ||
- | In a more realistic example, a **netperf** uses a **default size** of **2K** for requests and **32K** for responses.</note> | + | </spoiler> |
+ | ===== Tasks ===== | ||
- | {{ :ep:laboratoare:ep2_poz7.png?470 |}} | + | The skeleton for this lab can be found in this [[https://github.com/cs-pub-ro/EP-labs|repository]]. Clone it locally before you start. |
- | + | ||
- | + | ||
- | <note tip>The transaction rate reduces significantly to 222 transactions per second. | + | |
- | </note> | + | |
- | === Using iperf to Measure Network Efficiency === | + | |
- | + | ||
- | <note>The **iperf** tool is similar to the **netperf** tool in that it checks connections between two endpoints. The difference with **iperf** is that it has more __in-depth checks__ around TCP/UDP efficiency such as __window sizes__ and __QoS settings__. The tool is designed for administrators who specifically want to **tune TCP/IP stacks** and then test the effectiveness of those stacks. The **iperf** tool is a single binary that can run in __either server or client mode__. The tool runs on port **5001** by default. In addition to TCP tests, **iperf** also has UDP tests to measure __packet loss and jitter__.</note> | + | |
- | + | ||
- | ==== 03. Individual Connections with tcptrace ==== | + | |
- | <note> | + | |
- | The **tcptrace** utility provides detailed TCP based information about specific connections. The utility uses **libpcap** based files to perform an analysis of specific TCP sessions. The utility provides information that is at times difficult to catch in a TCP stream. This information includes: | + | |
- | * **TCP Retransmissions** – the amount of packets that needed to be sent again and the total data size | + | |
- | * **TCP Window Sizes** – identify slow connections with small window sizes | + | |
- | * **Total throughput** of the connection | + | |
- | * **Connection duration** | + | |
- | </note> | + | |
- | + | ||
- | For more information refer to pages 34-37 from Darren Hoch’s [[http://ufsdump.org/papers/oscon2009-linux-monitoring.pdf | Linux System and Performance Monitoring]]. | + | |
- | + | ||
- | ==== 04. TCP and UDP measurments ==== | + | |
- | TODO: recitit si bolduit | + | |
- | <note tip>"Time remaining: 12 Hours! What's wrong with the network?" m( | + | |
- | * This issue is all too common and it has __nothing to do with the network__. | + | |
- | </note> | + | |
- | + | ||
- | + | ||
- | === TCP measurements: throughput, bandwidth === | + | |
- | <note> | + | |
- | * Capacity: link speed | + | |
- | * Narrow Link: link with the lowest capacity along a path | + | |
- | * Capacity of the end-to-end path = capacity of the narrow link | + | |
- | * Utilized bandwidth: current traffic load | + | |
- | * Available bandwidth: capacity – utilized bandwidth | + | |
- | * Tight Link: link with the least available bandwidth in a path | + | |
- | * Achievable bandwidth: includes protocol and host issues | + | |
- | * Many things can limit TCP throughput: | + | |
- | * Loss | + | |
- | * Congestion | + | |
- | * Buffer Starvation | + | |
- | * Out of order delivery | + | |
- | </note> | + | |
- | + | ||
- | === TCP performance: window size === | + | |
- | <note> | + | |
- | * In data transmission, TCP sends a certain amount of data and then pauses; | + | |
- | * To ensure proper delivery of data, it doesn't send more until it receives an acknowledgment from the remote host; | + | |
- | </note> | + | |
- | === TCP performance: Bandwith Delay Product (BDP) === | + | |
- | <note> | + | |
- | * The further away the two hosts, the longer it takes for the sender to receive the acknowledgment from the remote host, reducing overall throughput. | + | |
- | * To overcome BDP, we send more data at a time => we adjust the TCP Window. Telling TCP to send more data per flow than the default parameters. | + | |
- | </note> | + | |
- | + | ||
- | <note important>To get full TCP performance the TCP window needs to be large enough to accommodate the Bandwidth Delay Product.</note> | + | |
- | + | ||
- | === TCP performance: parallel streams - read/write buffer size === | + | |
- | <note> | + | |
- | * TCP breaks the stream into pieces transparently | + | |
- | * Longer writes often improve performance | + | |
- | * Let TCP “do it’s thing” | + | |
- | * Fewer system calls | + | |
- | * How? | + | |
- | * -l <size> (lower case ell) | + | |
- | * Example –l 128K | + | |
- | * UDP doesn’t break up writes, don’t exceed Path MTU | + | |
- | * The –P option sets the number of streams to use | + | |
- | </note> | + | |
- | + | ||
- | === UDP measurements === | + | |
- | <note> | + | |
- | * Loss | + | |
- | * Jitter | + | |
- | * Out of order delivery | + | |
- | * Use -b to specify target bandwidth (default is 1M) | + | |
- | </note> | + | |
- | + | ||
- | ==== Good to know: ==== | + | |
- | + | ||
- | <note important> | + | |
- | Takeaways for network performance monitoring: | + | |
- | * Check to make sure all **Ethernet interfaces** are running at proper rates. | + | |
- | * Check **total throughput** per **network interface** and be sure it is inline with network speeds. | + | |
- | * Monitor **network traffic** types to ensure that the **appropriate traffic** has precedence on the system. | + | |
- | </note> | + | |
- | ===== Tasks ===== | + | |
{{namespace>:ep:labs:04:contents:tasks&nofooter&noeditbutton}} | {{namespace>:ep:labs:04:contents:tasks&nofooter&noeditbutton}} | ||
- | |||