This shows you the differences between two versions of the page.
|
ep:teme:01 [2025/04/16 01:14] radu.mantu [3. Proof of work] |
ep:teme:01 [2026/03/04 14:35] (current) radu.mantu [Memory access tracing] |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ~~NOTOC~~ | ||
| - | |||
| ====== Assignment ====== | ====== Assignment ====== | ||
| - | ===== 1. Overview ===== | + | ===== 01. Overview ===== |
| - | <note> | + | The goal of this assignment is to implement a tool based on [[https://man.archlinux.org/man/perf_event_open.2.en|Linux Perf Events]] that is able to monitor main memory accesses performed by another process. |
| - | Code skeleton available at [[https://github.com/cs-pub-ro/EP-assignment-2025/]]. | + | |
| - | </note> | + | |
| - | ==== 1.1. Simulated network ==== | + | |
| - | In **topology.py** you have the following [[https://mininet.org/|Mininet]] topology. In our experiments, we will run **iperf3** servers on **h3** and clients on **h1** and **h2**. The goal of this assignment is for you to measure different TCP metrics for specific connections, plot the results and interpret the plot. | + | For this assignment you will be allowed to **work in pairs**. Also, you will need to have an **Intel CPU** capable of recording **MEM_INST_RETIRED** events. Anything newer than Nehalem should do. |
| - | {{ :ep:teme:assign-topology.png?700 |}} | + | ===== 02. Requirements ===== |
| - | ==== 1.2. Netlink socket diagnostics ==== | + | ==== Partner up ==== |
| - | Up until this point, you may have used **netstat**, but not its modern-day equivalent **ss**. The former gathers its information from ''/proc/net/tcp'' and other related virtual files. Needless to say, the available information is quite limited. For this reason, a special type of socket (i.e., [[https://www.man7.org/linux/man-pages/man7/netlink.7.html|Netlink socket]]) was created to communicate directly with the kernel. The [[https://www.man7.org/linux/man-pages/man7/sock_diag.7.html|socket diagnostics]] subsystem was built on top of Netlink in order to rapidly extract //extensive// information regarding local sockets and their connections. As you may have guessed, **ss** uses this subsystem. However, we want to interact with it directly. If we were to repeatedly invoke **ss** in order to get updated statistics regarding one such socket, we would incur needless overhead from repeatedly spawning these processes. This overhead would amount to ~2-3ms / execution, severely limiting our sampling frequency. | + | Select a partner for this assignment and submit your choice via [[https://forms.gle/unnN3f8pksSbg85g9|this form]]. \\ |
| + | If you can't find a partner, try advertising on the [[https://curs.upb.ro/2025/mod/forum/discuss.php?d=3902|assignment forum]]. | ||
| - | In **socket_diag.c** we have implemented a demo application that obtains the source and destination IPs and ports of all ESTABLISHED TCP connections, plus the inode of the associated socket. Yes, sockets have inodes too. Just check the ''/proc/<pid>/fd/'' of your browser process. Any symlink with a value such as ''socket:[122505]'' is a socket, and the numeric part is the inode. | + | <note important> |
| + | Only one student is required to complete the form on behalf of the team.\\ | ||
| + | Only one student (not necessarily the same) will have to upload the assignment on moodle.\\ | ||
| + | You are **required** to work with a partner on this assignment. | ||
| + | </note> | ||
| - | Anyway, try compiling **socket_diag.c** and execute it: | + | ==== Usage ==== |
| - | <code bash> | + | |
| - | $ gcc socket_diag.c -o socket_diag | + | |
| - | $ sudo ./socket_diag /proc/$$/ns/net | + | |
| - | ================================= | + | |
| - | sport : 49606 | + | |
| - | dport : 443 | + | |
| - | src ip : 192.168.100.16 | + | |
| - | dst ip : 3.67.245.95 | + | |
| - | inode : 24615 | + | |
| - | ================================= | + | |
| - | sport : 49596 | + | |
| - | dport : 443 | + | |
| - | src ip : 192.168.100.16 | + | |
| - | dst ip : 3.67.245.95 | + | |
| - | inode : 17878 | + | |
| - | ================================= | + | |
| - | ... | + | |
| - | </code> | + | |
| - | === Namespace compatibility === | + | Your application should be implemented in C/C++ and take as positional arguments the commandline invocation of the program under test. For example, ''./my_tracer curl http://example.com'' will launch the tracer program that will then fork() & exec() **curl** and start monitoring its memory transactions at the same time. In case you need to add flags to your application, you can separate them from the commandline of the child process with ''%%--%%''. |
| - | One of the challenges of network observability in Linux is dealing with [[https://www.man7.org/linux/man-pages/man7/namespaces.7.html|Network Namespaces]]. For instance, try to spin up a docker container and listen on a port using **netcat**. Can you identify that open port using **netstat** or **ss** from your //host system//, and not from your container? The answer is no. Your container operates in another namespace than the shell where you're running **netstat** and **ss**. The question is, what can you do to solve this problem? | + | ==== Memory access tracing ==== |
| - | Well, if you can identify a process that's running inside that container, you can **open()** its ''/proc/<pid>/ns/net'' symlink and use the [[https://www.man7.org/linux/man-pages/man2/setns.2.html|setns()]] syscall to transition your process within the same namespace. Any subsequent network-related operation (including queries to the Socket Diagnostics subsystem) will target the container's namespace. We have already implemented this functionality for you in **socket_diag.c**. That is why we needed to pass it an argument in the example above. | + | Once the child process is up and running, you will have to monitor the **read** and **write** operations //separately//. Specifically, you will have to determine **what address has been accessed** and **what instruction performed this access**. This can be achieved using [[https://www.intel.com/content/www/us/en/developer/articles/technical/timed-process-event-based-sampling-tpebs.html|Intel Processor Event Based Sampling (PEBS)]], a mode of operation that will write detailed sample information in a physical memory ring buffer whenever the event counter triggers. You will not be required to interact with this system directly, but instead utilize the [[https://man.archlinux.org/man/perf_event_open.2.en#MMAP_layout|sampled mode]] of Linux Perf Events. |
| - | ==== 1.3. bpftune ==== | + | ==== Mapping addresses to objects ==== |
| - | In our earlier network monitoring lab, we briefly discussed about eBPF. [[https://github.com/oracle/bpftune|bpftune]] is a tool created by Oracle that leverages eBPF's capability to dynamically instrument the TCP/IP stack (similar to [[https://github.com/cilium/pwru|pwru]]) to perform auto-tuning depending on the network conditions. For example, it may adjust the socket buffer sizes whenever their use exceeds a certain threshold. | + | Once this task is complete, your next objective is to map both the accessed address and the instruction's address to a memory mapped object (where appropriate). For instance, you will have to be able to distinguish between a memory access performed by code belonging to **libc** or **libz**. Additionally, you must identify whether the accessed memory address belongs to a data segment of a memory mapped object, or the heap / stack instead. To solve this task, know that the Linux Perf system can generate more than PMC Event Records while in sampled mode. In fact, the kernel can be configured to report any **mmap()** that the program under test performs. This is how **perf record** can embed object information into the sample file in order for **perf report** to subsequently translate those samples into //"hot"// functions, even with ASLR enabled. |
| - | One interesting feature is that it has support for //network namespaces//, meaning that it can apply these optimizations on a per-node bases in our Mininet simulation. Also, we only need to run one instance of it and it will automatically detect existing namespaces. Compile and install **bpftune**. You can run it with the **-s** flag to force it to output its changes to stdout. | + | <note> |
| + | It is possible for memory accesses to be performed by instructions located in non-file backed regions. For example, JIT-ed JavaScript code generated by **V8** for Chromium and **SpiderMoneky** for Firefox, or **LuaJit** for Neovim plugins or World of Warcraft addons. | ||
| + | </note> | ||
| - | ===== 2. Tasks ===== | + | ==== Plotting ==== |
| - | ==== 2.1. [20p] Set up the network simulation ==== | + | The final implementation task is to create a **dynamic** visualization interface that can show the amount of both memory reads and writes performed live, as well as the locations being accessed and the objects performing them. Note that you must provide a **fine-grained view** of each object. For example, if you decide to implement this feature as a histogram, you will have to create //multiple// buckets for each object. So if you create a micro-benchmark that follows a linear memory access pattern in heap, your visualization tool must show how each bucket representing the heap region gets filled, one by one. |
| - | Execute the **topology.py** script with sudo privileges. Don't mess around with the script arguments just yet. Once you've obtained the ''mininet>'' prompt, open one terminal for each host. Select your preferred terminal (e.g., kitty, gnome-terminal, xterm, etc.) | + | <note tip> |
| - | + | You are free to implement this feature in any way you desire. E.g., you can pass the data to be plotted to a Python3 script that generates a [[https://matplotlib.org/stable/users/explain/figure/interactive.html|matplotlib interactive figure]]. Or you can generate an in-process frontend using [[https://github.com/ocornut/imgui|ImGui]] or [[https://www.man7.org/linux/man-pages/man3/ncurses.3x.html|ncurses]]. Or you can write an HTTP server that can accept state updates over the network and display the plots in your browser. These are just a few ideas; feel free to utilize whatever you're most comfortable with. | |
| - | <code bash> | + | |
| - | mininet> h1 kitty & | + | |
| - | mininet> h2 kitty & | + | |
| - | mininet> h3 kitty & | + | |
| - | </code> | + | |
| - | + | ||
| - | You can spawn multiple terminals on the same host. Additionally, you can even run **wireshark** if you need to debug something. | + | |
| - | On **h3**, run an **iperf3** TCP server. From **h1**, connect to that sever with an **iperf3** client. What throughput did you obtain? | + | |
| - | + | ||
| - | Next, spawn another **iperf3** server on **h3**, but this time make it UDP. Start two simultaneous connections: TCP from **h1** to h3 and UDP from **h2** to h3 (after a few seconds). For the UDP connection, set the bandwidth to 10Mbps from **iperf3**'s command line arguments. What is the throughout of each experiment? | + | |
| - | + | ||
| - | <note warning> | + | |
| - | Do not try to do this in **wsl**. It's kernel implements network namespaces very poorly and you will have disastrous results. You can however, solve this assignment in a VM. | + | |
| </note> | </note> | ||
| - | |||
| - | ==== 2.2. [30p] Implement connection monitoring tool ==== | ||
| - | |||
| - | Starting from **socket_diag.c**, follow the three TODOs. You will need to isolate the **iperf3** socket used for data transfers based on the source and destination IPs and ports. Additionally, you will have to ask the kernel to give you a [[https://github.com/torvalds/linux/blob/master/include/uapi/linux/tcp.h#L228|tcp_info]] structure in its reply. This structure counts as an optional attribute that you will have to extract from the reply. As you can see, it contains a large number of metrics that you can monitor. | ||
| - | |||
| - | Use this tool of yours to //continuously// monitor the **iperf3** data transfer over a TCP connection for one minute. Determine the **throughput** and **congestion window** for every tcp_info sample. Plot these values as functions of time and explain what you observe. Ask [[https://grok.com/|grok]] what each field in the tcp_info structure represents and select additional metrics that may support your hypothesis. | ||
| <note important> | <note important> | ||
| - | You may change whatever you want in **socket_diag.c**. Don't just stop after the three TODOs. | + | Small bonus available if you can limit the displayed samples to a user-specified time window. In other words, show the memory access distribution for the past **N** seconds while continuously updating the plot. Whether a sample is part of the window or not should be decided based on the time it was taken, not when you consumed it from the record ring buffer. Perf also has an option for attaching a timestamp to each record. |
| - | ---- | + | |
| - | You can choose whether to keep **setns()** or just run the program in the same network namespace as **iperf3** (i.e., from within another **h1** terminal). Just pick whatever solution seems easiest to you. | + | |
| - | ---- | + | |
| - | **iperf3** will open //two// connections to the server. The first is used to negotiate the experiment parameters and exchange final measurements. The second is used to actually transfer the data and stress test the network. You're interested in the latter, not the former. | + | |
| </note> | </note> | ||
| - | ==== 2.3. [30p] Differential analysis ==== | + | ==== Documentation ==== |
| - | Try varying the bandwidths and delays of the **h1-r1** and **h2-r1** links. Best if you keep them symmetric. Record the same metrics that you've used in your previous experiment. | + | Implementation aside, your last task is to test and document your project. Your documentation should be in PDF format and describe your design choices, what tasks you found most difficult, how you solved those problems, and how you tested your tracer. Naturally, this implies you adding plots generated after tracing //multiple// benchmark programs. Explain how you chose these benchmarks and what observations you could make. |
| - | Create two figures, one for the bandwidth-varying experiment, and one for the delay-varying experiment. Create multiple plots for these experiments within the same figure and explain what impact these variations had. Just to clarify, for the //"throughput as a function of time"// figure, plot each experiment where you vary the delay with **±k * 25ms** (with k = 0, 1, 2, 3, ...) and label them accordingly. Aim for something like [[https://stackoverflow.com/questions/22276066/how-to-plot-multiple-functions-on-the-same-figure|this]]. Also, that value of 25ms is just a suggestion. | + | <note tip> |
| + | The goal of this documentation is to convince the reader of the soundness of your design and implementation. Try to pose and answer questions such as //What guarantee do we have that the sampling is uniform? Is it possible to have a burst of localized samples followed by a period of PMC inactivity?// or //How did we verify that both read and write accesses have been reported, and not just one type?//. | ||
| - | Automate the data acquisition part of this task as much as possible. Include any scripts that you've written / modified in your submission. | + | Such issues will arise naturally as you implement the assignment so don't give them much thought beforehand. But remember to address them in the end. Also, needless to say, don't limit yourselves to these examples. |
| + | </note> | ||
| - | <note tip> | + | ===== Grading ===== |
| - | These experiments that you are performing reference a few //specific// features of the TCP protocol. | + | |
| - | </note> | + | |
| - | ==== 2.4. [20p] Evaluate bpftune impact ==== | + | The deadline for this assignment is **11 May**. Upload a **zip archive** containing the source code, Makefile, documentation and any **micro-benchmarks** used in testing (don't go and include **redis** in your submission). The archive should be uploaded to this [[https://curs.upb.ro/2025/mod/assign/view.php?id=135330|moodle assignment]]. |
| - | Try running **bpftune** on your host and re-run the experiment from the first task (with the TCP and UDP simultaneous **iperf3** connections). Note what changes it makes to the system. Read the source code and try to figure out the criteria that triggered the tuner. Do these changes have any visible effect? | + | This assignment is worth **1.5p** of your final grade. The breakdown by task is as follows: |
| + | * **Memory access tracing (30%):** If nothing else, the application can provably monitor memory accesses by printing the relevant information to //stdout//. | ||
| + | * **Mapping addresses to objects (30%):** The application should be able to generate statistics for both accessed data regions and code regions performing the accesses. Reads and writes must be treated separately. | ||
| + | * **Plotting (10%):** Live illustration of the statistics mentioned in the previous task. Be creative and include even more data if you can. | ||
| + | * **Documentation (30%):** Adequately explains the design and implementation. Can convincingly prove that both are sound. Describes the testing methodology and presents the results in a //concise// but thorough manner. In other words: //"Someone has to read this so be considerate and don't waste their time. Improves your chances of not pissing them off."// | ||
| - | ===== 3. Proof of work ===== | + | <note important> |
| + | The **first pair** that submits an assignment that receives **full marks** will automatically pass the exam with a maximum grade. | ||
| + | </note> | ||
| - | Your submission must be uploaded to [[https://curs.upb.ro/2024/mod/assign/view.php?id=156520|moodle]] by the **7th of May, 11:59pm** and must contain the following: | + | ===== FAQ ===== |
| - | - A **pdf report** (max. 5 pages, negotiable) with all your observations from each task, as well as plots illustrating your experiments. Writing this report in LaTeX is recommended but not obligatory. The plots can be generated in LaTeX from raw data (which you must include). | + | |
| - | - The Netlink Socket Diagnostics tool that you've implemented and used in acquiring runtime data. | + | |
| - | - Any scripts used for automating boring / repetitive tasks. | + | |
| + | :?: | ||