This is an old revision of the document!

03. [30p] Packets, where are you?

Earlier in Ex. 1, we mentioned that eBPF is used for more than traffic filtering. Some of you may have heard of the eXpress Data Path (XDP) or the more recent eXpress Resubmission Path (XRP). Both of these are eBPF-powered shunts of kernel data paths that are used to optimize the system for very specific types of workloads. We'll return to these in a future lecture (and maybe a lab as well) since they can be considered advanced topics. For now, we'll focus on the third purpose eBPF can serve: execution tracing.

pwru is a tool created by Cilium to help trace network packets in the kernel's network stack and debug network connectivity issues. It does this by attaching simple eBPF programs to certain function entry points. These programs can report back to a userspace process different kinds of information, including the function that was reached, the arguments that were passed, and a CPU clock timestamp. The method used for instrumenting kernel code is based on kprobes. Ask your assistant for more information.

[10p] Task A — A packet's journey

Installation — build from source

Pre-built packages are no longer maintained for most distributions, so you'll build pwru from source. All you need is a Go compiler and make.

# Install Go if you don't have it
$ sudo apt install golang-go   # Ubuntu/Debian
# or follow https://go.dev/dl/ for the latest version
 
# Clone and build
$ git clone https://github.com/cilium/pwru.git
$ cd pwru
$ make
$ sudo mv pwru /usr/local/bin/
 
 
The build takes about a minute on first run (Go downloads dependencies). The result is a statically linked binary with no runtime dependencies.
 
**Minimum requirements** (check before running):
 
  * Linux kernel ≥ 5.5 (for BTF support): ''uname -r''
  * BTF enabled: ''ls /sys/kernel/btf/vmlinux'' — file must exist
  * ''bpf'' filesystem mounted: ''mount | grep bpf''
 
If BTF is missing, ''pwru'' will fail immediately with a clear error message.

Now, trace all outgoing DNS queries to the Google DNS (i.e.: 8.8.8.8) and perform one using dig. Add relative timestamps to the individual trace entries, to get an idea of the computational cost of each operation.

Finally, insert an iptables rule on the OUTPUT chain that drops DNS queries to 8.8.8.8 and redo the experiment. Check where the packet's path is cut short (the reason should be obvious :p).

Ubuntu users: local DNS caching via systemd-resolved may intercept your query before it reaches the network. If pwru shows nothing, try:

$ sudo systemd-resolve --flush-caches

or target 127.0.0.53 to confirm caching is the issue.

[20p] Task B - Interpreting the call path

Analyze the call path in the kernel network stack for the first scenario (when the packet actually made it out). Explain each step of the packet's journey.

Check out this map of the kernel subsystems, but note that the best source of information is always RTFS.

To structure your analysis, answer these questions in order:

Where does the packet originate? Which function is the first to appear in the trace? What layer of the network stack does it correspond to?
How does it reach the IP layer? Identify the transition from the socket/transport layer to the IP layer. Which function marks this boundary?
What does Netfilter do here? Identify nf_hook_slow in the trace. Which Netfilter hook point does it correspond to (refer back to Figure 1 from Task 01)?
How does it leave the machine? Identify the function responsible for handing the packet to the network device driver. What happens after this point?
What changed with the DROP rule? Compare the two traces side by side. At which function does the path diverge?

04. [20p] bpftrace

In Lab 05 you used bpftrace exclusively via one-liners (-e flag). That works fine for quick investigations, but as your probes get more complex — multiple hooks, conditionals, helper functions — you'll want to write proper script files (.bt extension).

The difference is minimal syntactically, but it is quite important in practice: a script file can have comments, be version-controlled, be shared with teammates, and be run with sudo bpftrace script.bt without the shell escaping headaches that come with one-liners.

In this task you'll write two scripts targeting functions you observed in your pwru trace from Task 03.

Before starting: make sure you have a clean iptables state. Remove any DROP rules you added in Task 03:

$ sudo iptables -D OUTPUT -p udp -d 8.8.8.8 --dport 53 -j DROP
# verify:
$ sudo iptables -L OUTPUT -n

[0p] Task A: Demo: coding style for bpftrace scripts

Before writing your own scripts, study this example. It is not a task — there is nothing to submit. It exists to show what a well-structured .bt script looks like, so you have a reference when writing the next two.

You can also find out more about the bpftrace coding style here

nf_demo.bt

#!/usr/bin/bpftrace
 
BEGIN
{
    printf("Tracing nf_hook_slow... Ctrl+C to stop.\n\n");
}
 
/*
 * fentry fires at the entry of the kernel function.
 * Faster and lower-overhead than kprobe.
 * 'comm' is a bpftrace built-in: the name of the current process.
 */
fentry:nf_hook_slow
{
    @invocations_by_process[comm]++;
}
 
/* Print and reset every 3 seconds */
interval:s:3
{
    printf("-- %s --\n", strftime("%H:%M:%S", nsecs));
    print(@invocations_by_process);
    printf("\n");
    clear(@invocations_by_process);
}
 
END
{
    printf("Done.\n");
}

Run it for a few seconds while generating some traffic (curl, dig, etc.) and observe the output. Then read through the script again. This is the style expected in Task B.

[20p] Task B: The cost of a bloated rule chain

nf_hook_slow, which is visible in your pwru trace from Task 03, is the function that walks the iptables rule chain for every packet. Its cost is not fixed: it scales with the number of rules in the chain, and within each rule, with the number of match flags specified. A rule like -p tcp -d 8.8.8.8 –dport 443 invokes three separate match callbacks in sequence; if any returns false, evaluation stops for that rule and moves on to the next one. On a long chain, this adds up.

A common real-world mistake: a sysadmin responds to unwanted traffic by adding one DROP rule per offending source IP, one at a time, instead of a single rule covering the entire prefix. After hours or days of this, the chain has thousands of rules. Every packet, regardless of its actual destination, must walk the entire chain before reaching the default policy. On a modest server, this is enough to cause visible throughput degradation.

You are going to reproduce this and measure it.

What you need

iperf3: sudo apt install iperf3
An iperf3 server — see options below
bpftrace (from previous tasks)
Python with matplotlib and pandas for the plot

Setting up a local iperf3 server

Running a local server eliminates network variability from the experiment, the iptables overhead signal becomes much cleaner and easier to observe in the plot. Pick one of the two options below depending on your setup.

Option 1: Docker container with Arch Linux

If you have Docker installed, you can spin up an Arch Linux container that shares the host's network stack. The –network host flag means the container does not get its own network namespace, but it uses yours. Traffic to 127.0.0.1:5201 goes through the host OUTPUT chain exactly as if a local process were listening.

# Pull the Arch Linux image (first time only, ~170MB)
$ docker pull archlinux
 
# Start the container: share host network, install iperf3, start server
$ docker run -d --rm \
    --name iperf3-server \
    --network host \
    archlinux \
    sh -c "pacman -Sy --noconfirm iperf3 2>/dev/null && iperf3 -s"
 
# Wait ~15 seconds for pacman to finish, then test
$ iperf3 -c 127.0.0.1 -p 5201 -t 5
 
# Stop when done (--rm means the container is deleted automatically)
$ docker stop iperf3-server

Docker in one sentence: a container is a process running in its own isolated namespaces (filesystem, PID, network, etc.). –network host disables the network isolation specifically, so the container's iperf3 process binds to your machine's port 5201 directly with no port forwarding, no bridge, no NAT.

The pacman -Sy step runs inside the container every time it starts. This takes ~15 seconds on first run (downloads packages) and is faster on subsequent runs if Docker layer caching applies. If this is too slow, build a local image once:

$ echo -e 'FROM archlinux\nRUN pacman -Sy --noconfirm iperf3' > Dockerfile.iperf3
$ docker build -t local/iperf3 -f Dockerfile.iperf3 .
$ docker run -d --rm --name iperf3-server --network host local/iperf3 iperf3 -s

Option 2: network namespace (no Docker required)

This creates an isolated network environment using Linux network namespaces and a virtual Ethernet pair (veth), exactly like Docker does internally. See RL Lab 10 for a deeper dive into how this works.

# 1. Create the namespace
$ sudo ip netns add iperf3-ns
 
# 2. Create a veth pair: one end stays on the host, one goes into the namespace
$ sudo ip link add veth-host type veth peer name veth-ns
 
# 3. Move one end into the namespace
$ sudo ip link set veth-ns netns iperf3-ns
 
# 4. Configure the host-side interface
$ sudo ip addr add 10.99.0.1/24 dev veth-host
$ sudo ip link set veth-host up
 
# 5. Configure the namespace-side interface
$ sudo ip netns exec iperf3-ns ip addr add 10.99.0.2/24 dev veth-ns
$ sudo ip netns exec iperf3-ns ip link set veth-ns up
$ sudo ip netns exec iperf3-ns ip link set lo up
 
# 6. Start iperf3 server inside the namespace (background)
$ sudo ip netns exec iperf3-ns iperf3 -s -D
 
# 7. Test from the host (server is at 10.99.0.2)
$ iperf3 -c 10.99.0.2 -p 5201 -t 5

Traffic from the host to 10.99.0.2 is routed through the kernel's normal IP output path and hits the OUTPUT chain where nf_hook_slow is instrumented correctly.

When done with the experiment:

$ sudo ip netns delete iperf3-ns
$ sudo ip link delete veth-host

The bpftrace script

based on the example provided in demo, try and make a similar script by yourself, using probes as you did in Lab 05

We reccomend using kprobe/kretprobe instead of fentry/fexit for portability, since kprobes work on kernels without full BTF support, which some VMs lack. The instrumentation overhead is slightly higher, but negligible alongside a 300-second iperf3 test.

The experiment

After you made that script, run it in a terminal and two other commands like this:

Open three terminals and run the following simultaneously: iperf3 throwing the putput in a JSON, bftrace measurement and the injection of some rules (we recommend injecting more then 5000)

Hint: repetitive structures

iperf3 with -J produces a JSON file containing per-second interval data. Each interval entry includes sum.bits_per_second (throughput) and streams[0].snd_cwnd (TCP congestion window size in bytes).

Wait for iperf3 to finish naturally (or stop it after the loop completes). Then clean up:

$ sudo iptables -F OUTPUT

The plot

Write a Python script called plot_results.py that:

Parses iperf_results.json and extracts, for each 1-second interval: timestamp (seconds from start), throughput in Mbit/s, and congestion window in KB
Parses bpf_results.txt and extracts, for each 10-second interval: the average nf_hook_slow latency in nanoseconds
Produces a single figure with two subplots stacked vertically, sharing the X axis (time in seconds):
- Top subplot: throughput (Mbit/s) as a line, congestion window (KB) as a line on a secondary Y axis
- Bottom subplot: average nf_hook_slow latency (ns) as a step plot, one point per 10s interval
Adds a vertical shaded region (axvspan) marking the approximate time window during which iptables rules were being injected (you can estimate this from when you started Terminal 3)
Saves the figure as results.png

Questions

Answer the following in your report:

At what approximate rule count (or time) does throughput begin to visibly degrade? Does congestion window change in the same way?
Is the latency increase in nf_hook_slow linear with rule count? What does this tell you about the algorithm used to walk the chain?

General Information

Lectures

Labs

Assignments

Archived Labs

03. [30p] Packets, where are you?
- [10p] Task A — A packet's journey
- [20p] Task B - Interpreting the call path
04. [20p] bpftrace

ep/labs/061/contents/tasks/ex3.1775506909.txt.gz · Last modified: 2026/04/06 23:21 by maria.popescu2812

Old revisions

Media Manager Back to top