This is an old revision of the document!


04. [30p] bpftrace

In Lab 05 you used bpftrace exclusively via one-liners (-e flag). That works fine for quick investigations, but as your probes get more complex — multiple hooks, conditionals, helper functions — you'll want to write proper script files (.bt extension).

The difference is minimal syntactically, but it is quite important in practice: a script file can have comments, be version-controlled, be shared with teammates, and be run with sudo bpftrace script.bt without the shell escaping headaches that come with one-liners.

In this task you'll write two scripts targeting functions you observed in your pwru trace from Task 03.

Before starting: make sure you have a clean iptables state. Remove any DROP rules you added in Task 03:

$ sudo iptables -D OUTPUT -p udp -d 8.8.8.8 --dport 53 -j DROP
# verify:
$ sudo iptables -L OUTPUT -n

[0p] Task A: Demo: coding style for bpftrace scripts

Before writing your own scripts, study this example. It is not a task — there is nothing to submit. It exists to show what a well-structured .bt script looks like, so you have a reference when writing the next two.

You can also find out more about the bpftrace coding style here

nf_demo.bt
#!/usr/bin/bpftrace
 
BEGIN
{
    printf("Tracing nf_hook_slow... Ctrl+C to stop.\n\n");
}
 
/*
 * fentry fires at the entry of the kernel function.
 * Faster and lower-overhead than kprobe.
 * 'comm' is a bpftrace built-in: the name of the current process.
 */
fentry:nf_hook_slow
{
    @invocations_by_process[comm]++;
}
 
/* Print and reset every 3 seconds */
interval:s:3
{
    printf("-- %s --\n", strftime("%H:%M:%S", nsecs));
    print(@invocations_by_process);
    printf("\n");
    clear(@invocations_by_process);
}
 
END
{
    printf("Done.\n");
}

Run it for a few seconds while generating some traffic (curl, dig, etc.) and observe the output. Then read through the script again. This is the style expected in Task B.

[30p] Task B: The cost of a bloated rule chain

nf_hook_slow, which is visible in your pwru trace from Task 03, is the function that walks the iptables rule chain for every packet. Its cost is not fixed: it scales with the number of rules in the chain, and within each rule, with the number of match flags specified. A rule like -p tcp -d 8.8.8.8 –dport 443 invokes three separate match callbacks in sequence; if any returns false, evaluation stops for that rule and moves on to the next one. On a long chain, this adds up.

A common real-world mistake: a sysadmin responds to unwanted traffic by adding one DROP rule per offending source IP, one at a time, instead of a single rule covering the entire prefix. After hours or days of this, the chain has thousands of rules. Every packet, regardless of its actual destination, must walk the entire chain before reaching the default policy. On a modest server, this is enough to cause visible throughput degradation.

You are going to reproduce this and measure it.

What you need

  • iperf3: sudo apt install iperf3
  • An iperf3 server — see options below
  • bpftrace (from previous tasks)
  • Python with matplotlib and pandas for the plot

Setting up a local iperf3 server

Running a local server eliminates network variability from the experiment, the iptables overhead signal becomes much cleaner and easier to observe in the plot. Pick one of the two options below depending on your setup.

Option 1: Docker container with Arch Linux

If you have Docker installed, you can spin up an Arch Linux container. This container will use the same TCP/IP stack as the host, but will have distinct network devices, routing tables, firewall rules, etc. Any packet that leaves the container will have to pass through the network stack twice.

# start the container
host$ docker run -ti --rm archlinux
 
# show IP address of container and run iperf3
arch$ pacman -Sy --noconfirm iperf3 iproute2
arch$ ip -c a s
arch$ iperf -s
 
# test if it works (should have >40Gbps throughput)
host$ iperf3 -c ${container_ip} -p 5201 -t 5

Option 2: network namespace (no Docker required)

This creates an isolated network environment using Linux network namespaces and a virtual Ethernet pair (veth), exactly like Docker does internally. See RL Lab 10 for a deeper dive into how this works.

# 1. Create the namespace
$ sudo ip netns add iperf3-ns
 
# 2. Create a veth pair: one end stays on the host, one goes into the namespace
$ sudo ip link add veth-host type veth peer name veth-ns
 
# 3. Move one end into the namespace
$ sudo ip link set veth-ns netns iperf3-ns
 
# 4. Configure the host-side interface
$ sudo ip addr add 10.99.0.1/24 dev veth-host
$ sudo ip link set veth-host up
 
# 5. Configure the namespace-side interface
$ sudo ip netns exec iperf3-ns ip addr add 10.99.0.2/24 dev veth-ns
$ sudo ip netns exec iperf3-ns ip link set veth-ns up
$ sudo ip netns exec iperf3-ns ip link set lo up
 
# 6. Start iperf3 server inside the namespace (background)
$ sudo ip netns exec iperf3-ns iperf3 -s -D
 
# 7. Test from the host (server is at 10.99.0.2)
$ iperf3 -c 10.99.0.2 -p 5201 -t 5

Traffic from the host to 10.99.0.2 is routed through the kernel's normal IP output path and hits the OUTPUT chain where nf_hook_slow is instrumented correctly.

When done with the experiment:

$ sudo ip netns delete iperf3-ns
$ sudo ip link delete veth-host

The bpftrace script

based on the example provided in demo, try and make a similar script by yourself, using probes as you did in Lab 05

We reccomend using kprobe/kretprobe instead of fentry/fexit for portability, since kprobes work on kernels without full BTF support, which some VMs lack. The instrumentation overhead is slightly higher, but negligible alongside a 300-second iperf3 test.

The experiment

After you made that script, run it in a terminal and two other commands like this:

Open three terminals and run the following simultaneously: iperf3 throwing the putput in a JSON, bftrace measurement and the injection of some rules (we recommend injecting more then 5000)

Hint: repetitive structures

iperf3 with -J produces a JSON file containing per-second interval data. Each interval entry includes sum.bits_per_second (throughput) and streams[0].snd_cwnd (TCP congestion window size in bytes).

Wait for iperf3 to finish naturally (or stop it after the loop completes). Then clean up:

$ sudo iptables -F OUTPUT

The plot

Write a Python script called plot_results.py that:

  1. Parses iperf_results.json and extracts, for each 1-second interval: timestamp (seconds from start), throughput in Mbit/s, and congestion window in KB
  2. Parses bpf_results.txt and extracts, for each 10-second interval: the average nf_hook_slow latency in nanoseconds
  3. Produces a single figure with two subplots stacked vertically, sharing the X axis (time in seconds):
    • Top subplot: throughput (Mbit/s) as a line, congestion window (KB) as a line on a secondary Y axis
    • Bottom subplot: average nf_hook_slow latency (ns) as a step plot, one point per 10s interval
  4. Adds a vertical shaded region (axvspan) marking the approximate time window during which iptables rules were being injected (you can estimate this from when you started Terminal 3)
  5. Saves the figure as results.png

Questions

Answer the following in your report:

  1. At what approximate rule count (or time) does throughput begin to visibly degrade? Does congestion window change in the same way?
  2. Is the latency increase in nf_hook_slow linear with rule count? What does this tell you about the algorithm used to walk the chain?
ep/labs/061/contents/tasks/ex4.1775514977.txt.gz · Last modified: 2026/04/07 01:36 by radu.mantu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0