Show page

Differences

This shows you the differences between two versions of the page.

--- ep:labs:05:contents:tasks:ex3 [2025/02/11 23:43]
cezar.craciunoiu
+++ ep:labs:05:contents:tasks:ex3 [2026/03/31 02:50] (current)
radu.mantu
@@ Line 1: / Line 1: @@
-==== 03. [30p] RAM disk ====
+==== 03. [30p] bpftrace ====
+The [[https://ebpf.io/|extended Berkley Packet Filter (eBPF)]] is an under-represented technology in CS curricula that has been around since 1994 but has served multiple purposes along the years. As a //tl;dr//, what you need to know about eBPF is that it's a purely virtual [[https://docs.kernel.org/6.3/bpf/instruction-set.html|instruction set]], meaning that no hardware implements it. eBPF programs can be uploaded to the kernel, where they are JIT translated to native bytecode and become callable by other kernel components.
+The question is: why would we go through all this trouble instead of using a [[https://embetronicx.com/tutorials/linux/device-drivers/linux-device-driver-tutorial-part-2-first-device-driver/|Linux Kernel Module (LKM)]]? Unlike LKMs, eBPF programs have a simpler structure and can be more easily verified by the kernel. Before being JIT translated, the kernel must ensure their safety by enforcing certain properties. For example, eBPF programs are //guaranteed// to finish. How is this property checked and enforced? By making sure that eBPF programs have //no back jumps//. As you can imagine, this makes even writing a simple ''for'' loop a challenge.
+Initially, BPF (the **extended** part was added when x64 architectures appeared ca. 2004) was used as a filtering criteria for network packet captures, limiting the amount of data copied to a userspace process for analysis. This is still used to this day. Try running **tcpdump <expression>** and adding the **-d** flag. Instead of actually listening for packets, this will dump the BPF program that **tcpdump** would otherwise compile from that expression and upload to the kernel. That program is invoked for each packet and it decides whether the **tcpdump** process should receive a copy of it.
+More recently (since approx. 2012), eBPF has been used in cloud native solutions such as [[https://cilium.io/|Cilium]] for profiling, resource observability and network policy enforcement. Technologies such as these have been long used by Netflix and Meta internally and are now becoming increasingly more relevant. You can find more information about this topic in [[https://isovalent.com/blog/post/cilium-up-and-running/#cilium-up-and-running-is-finally-available-and-ready-for-download|Cilium: Up and Running]], a recent book released by Isovalent, a company specialized in microservice architectures that was acquired by Cisco in 2024 to help improve their inter-cloud security technologies.
+=== [0p] Task A - Hello World ===
+[[https://man.archlinux.org/man/extra/bpftrace/bpftrace.8.en|bpftrace]] is a high-level scripting language that can be compiled into an eBPF program. This is similar to a **tcpdump** expression but implements more complex logic and can be used to //instrument// kernel functions. After installing the package, try running it on your system (**sudo** may be required):
+<code bash>
+$ bpftrace -e 'BEGIN { printf("hello world\n"); }'
+</code>
+A **bpftrace** script consists of multiple probes. Each probe is given a //specific// hook point available in the kernel and a function body where acquired data can be processed (e.g., incrementing a counter). ''BEGIN'' is a special type of probe that has no real correspondent symbol in the kernel, but instead is executed once, when the program starts. This is useful for initializing global counters, for instance.
-Linux allows you to use part of your RAM as a block device, viewing it as a hard disk partition. The advantage of using a RAM disk is the **extremely low latency** (even when compared to SSDs). The disadvantage is that all contents will be lost after a reboot.
 <note tip>
-There are two main types of RAM disks:
+Moving forward, you may find it useful to keep the [[https://bpftrace.org/docs/release_025/language|bpftrace language documentation]] open in another tab.
-  * **ramfs** - cannot be limited in size and will continue to grow until you run out of RAM. Its size can not be determined precisely with tools like **df**. Instead, you have to estimate it by looking at the "cached" entry from **free**'s output.
-  * **tmpfs** - newer than **ramfs**. Can set a size limit. Behaves exactly like a hard disk partition but can't be monitored through conventional means (i.e. **iostat**). Size can be precisely estimated using **df**.
 </note>
-=== [15p] Task A - Create RAM Disk ===
-Before getting started, let's find out the file system that our root partition uses. Run the following command (T - print file system type, h - human readable):
+=== [5p] Task B - Trace read() syscalls ===
+By running ''bpftrace -l'', we get a list of //all// available probes. Their name is a sequence of terms separated by '':''. The first term defines what type of probe it is. Meanwhile, the final term is the actual probe name. Here is a list of probe types, to get an idea of what can be monitored with **bpftrace**:
+  * **kprobe:** Attaches to any place inside a kernel function in a manner similar to breakpoints in **gdb**.
+  * **fentry:** Attaches to the entry of a kernel function. Safer and faster than kprobes.
+  * **tracepoint:** Developer placed hooks with user-friendly, structured arguments to inspect.
+  * **rawtracepoint:** Faster than tracepoints but provides raw arguments. Requires more knowledge of what you're monitoring.
+  * **hardware:** Hooks into CPU performance counters (remember the PMC task in the CPU monitoring lab).
+  * **software:** Subscribes to software-generated perf events. Yes, you can do **perf** sampling based on number of TCP packets sent, not just cache misses.
+  * **iter:** Iterates over kernel data structures. Not event driven and still experimental due to locking restrictions for eBPF.
+For this task, we are going to attach a probe to the ''sys_enter_read'' tracepoint and print the process name for each invocation:
 <code bash>
-$ df -Th
+# NOTE: you can shorten "tracepoint" to just "t"
+$ bpftrace -e 'tracepoint:syscalls:sys_enter_read { printf("%s\n", comm); }'
 </code>
-The result should look like this:
-<code>
+Notice how we use the built-in variable ''comm'' that automatically resolves to the [[https://elixir.bootlin.com/linux/v6.19.10/source/include/linux/sched.h#L1174|executable name]] to find out what process performed a [[https://man.archlinux.org/man/read.2|read()]] syscall.
-Filesystem     Type      Size  Used Avail Use% Mounted on
-udev           devtmpfs  1.1G     0  1.1G   0% /dev
+=== [10p] Task C - Filter read() syscalls ===
-tmpfs          tmpfs     214M  3.8M  210M   2% /run
-/dev/sda1      ext4      218G  4.1G  202G   2% / <- root partition
+For this task, try to modify the previous one-liner to only print the ''comm'' and ''pid'' of the processes that have performed an invalid **read()** syscall (i.e., the return value is negative). For this, you will have to use the ''args'' built-in to access the return value. Note however that the return value is not available in the //entry// hook, but instead only in the //exit// hook.
-tmpfs          tmpfs     1.1G  252K  1.1G   1% /dev/shm
-tmpfs          tmpfs     5.0M  4.0K  5.0M   1% /run/lock
+What errno codes have been returned? What do these errors mean?
-tmpfs          tmpfs     1.1G     0  1.1G   0% /sys/fs/cgroup
-/dev/sda2      ext4      923M   73M  787M   9% /boot
+<note>
-/dev/sda4      ext4      266G   62M  253G   1% /home
+You can use ''bpftrace -lv'' to get a detailed description of the ''args'' attributes that are available. For example:
+<code bash>
+$ bpftrace -lv sys_enter_read
+    tracepoint:syscalls:sys_enter_read
+        int __syscall_nr
+        unsigned int fd
+        char * buf
+        size_t count
+$ bpftrace -lv sys_exit_read
+    tracepoint:syscalls:sys_exit_read
+        int __syscall_nr
+        long ret
 </code>
+</note>
-From the results, we will assume in the following commands that the file system is **ext4**. If it's not your case, just replace with what you have:
+<note tip>
+There are two methods of specifying filters:
+  * An [[https://bpftrace.org/docs/release_025/language#conditionals|if statement]] inside the action block of the probe.
+  * A [[https://bpftrace.org/docs/release_025/language#filterspredicates|predicate]] specified between the probe name and the action block.
+</note>
+<solution -hidden>
 <code bash>
-$ sudo mkdir /mnt/ramdisk
+$ sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args.ret < 0/ { printf("%s (%d): %d\n", comm, pid, args.ret) }'
-$ sudo mount -t tmpfs -o size=1G ext4 /mnt/ramdisk
+    kitty (8441): -11
+    kitty (8441): -11
+    libinput-connec (1357): -11
+$ errno 11
+    EAGAIN 11 Resource temporarily unavailable
 </code>
+</solution>
-<note>
+=== [10p] Task D - Count read bytes ===
-If you want the RAM disk to persist after a reboot, you can add the following line to ///etc/fstab//. Remember that its contents will still be lost.
+In this task, we are going to count how much data each application has read while our **bpftrace** script has been running. For this, we are going to be using [[https://ebpf.hamza-megahed.com/docs/chapter2/2-maps/|eBPF maps]]. These maps consist of shared memory between the JIT translated programs that are resident in kernel space and the user applications that need to collect the data that the probes gather. In this case, the user space application is the **bpftrace** program.
+In its specific scripting language, maps are identified by a unique name and prefixed by the ''@'' symbol. Optionally, the map names can be followed by a ''[...]'', effectively turning them into hash maps. You can use these maps without declaring them in a ''BEGIN'' block, unless you want to initialize them with non-zero values. For example, incrementing the amount of data read on a per-application basis can be as simple as:
 <code>
-tmpfs     /mnt/ramdisk     tmpfs     rw,nodev,nosuid,size=1G     0  0
+@bytes_read[comm] += args.ret
 </code>
-</note>
-That's it. We just created a 1Gb **tmpfs** ramdisk with an **ext4** file system and mounted it at ///mnt/ramdisk//. Use **df** again to check this yourself.
+Make sure you filter out negative return values and execute your **bpftrace** script. Let it run for a few seconds, then interrupt it via a SIGINT (i.e., //Ctrl + C//). When unloading the probes and before terminating the process, all maps will be printed to //stdout//.
-=== [15p] Task B - Pipe View & RAM Disk ===
+<solution -hidden>
+<code bash>
+$ sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args.ret > 0/ { @read_bytes[comm] += args.ret }'
+</code>
+</solution>
-As we mentioned before, you can't get I/O statistics regarding **tmpfs** since it is not a real partition. One solution to this problem is using **pv** to monitor the progress of data transfer through a pipe. This is a valid approach only if we consider the disk I/O being the bottleneck.
+== Periodic statistics ==
-Next, we will generate 512Mb of random data and place it in ///mnt/ramdisk/file// first and then in ///home/student/file//. The transfer is done using **dd** with 2048-byte blocks.
+Let's say you want to display these statistics every 2 seconds and reset the counters after each print. Make it feel more like **vmstat**.
+Use the [[https://bpftrace.org/docs/release_025/language#interval|interval]] probe to achieve this. You can ''print()'' the map and then ''clear()'' it to reset its contents.
+<solution -hidden>
 <code bash>
-$ pv /dev/urandom | dd of=/mnt/ramdisk/rand  bs=2048 count=$((512 * 1024 * 1024 / 2048))
+$ sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args.ret > 0/ { @read_bytes[comm] += args.ret } interval:s:2 { print(@read_bytes); printf("\n"); clear(@read_bytes) }'
-$ pv /dev/urandom | dd of=/home/student/rand bs=2048 count=$((512 * 1024 * 1024 / 2048))
 </code>
+</solution>
-Look at the elapsed time and average transfer speed. What conclusion can you draw?
+=== [5p] Task E - Built-in histogram function ===
-:!: Put one screenshot with the tmpfs partition in df output and one screenshot of both pv commands and write your conclusion.
+Use the [[https://bpftrace.org/docs/release_025/stdlib#hist|hist()]] eBPF helper to visualize the distribution of bytes read for each syscall. The data that you visualize is not the total bytes read, but how many **read()** calls returned a value that fits within that specific log2 bucket.
+<solution -hidden>
+<code bash>
+$ sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read { @ = hist(args.ret) }'
+    @:
+    (..., 0)             303 |@@                                                  |
+    [0]                  531 |@@@@                                                |
+    [1]                 6246 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+    [2, 4)                10 |                                                    |
+    [4, 8)                44 |                                                    |
+    [8, 16)              725 |@@@@@@                                              |
+    [16, 32)              22 |                                                    |
+    [32, 64)             401 |@@@                                                 |
+    [64, 128)             32 |                                                    |
+    [128, 256)            55 |                                                    |
+    [256, 512)            85 |                                                    |
+    [512, 1K)            347 |@@                                                  |
+    [1K, 2K)             104 |                                                    |
+    [2K, 4K)             279 |@@                                                  |
+    [4K, 8K)              38 |                                                    |
+    [8K, 16K)             15 |                                                    |
+    [16K, 32K)            10 |                                                    |
+    [32K, 64K)             1 |                                                    |
+</code>
+</solution>

General Information

Lectures

Labs

Assignments

Archived Labs

ep/labs/05/contents/tasks/ex3.1739310218.txt.gz · Last modified: 2025/02/11 23:43 by cezar.craciunoiu

Show page Old revisions

Media Manager Back to top