export MANPAGER='nvim +Man!'
Export the environment variable (source the shell config file) and test that it works.
Both tools in this section rely on eBPF under the hood, so it's worth a 60-second refresher before we start.
eBPF (extended Berkeley Packet Filter) is a virtual instruction set built into the Linux kernel. You write a small program, the kernel verifies it for safety (no infinite loops, no bad memory access), JIT-compiles it to native code, and attaches it to a hook point — a socket, a syscall, a kernel function entry, a network interface, etc. The program runs in kernel space without you having to write a kernel module.
Originally (1994), BPF existed only to filter network packets for tools like tcpdump: instead of copying every packet to userspace and discarding most of them there, you push a filter program into the kernel and only the matching packets cross the boundary. The “extended” part arrived around 2004 with 64-bit architectures, bringing wider registers, more map types, and JIT support.
Today eBPF is used for packet filtering (tcpdump, iptables BPF match), system profiling (bpftrace, Cilium, Netflix's FlameScope), and network policy enforcement in Kubernetes clusters. You already used it in Lab 05 for I/O tracing. In this lab you'll see it appear in two more places: inside tcpdump's filter compiler and in the iptables BPF match module.
tcpdump is a network traffic monitoring tool. At its core, it uses libpcap which in turn uses a technology called Extended Berkley Packet Filter (eBPF).
BPF was first proposed around 1995 when filtering mechanisms (and firewalls) were still a novel concept and were based on interpreters. BPF (now referred to as Classic BPF - cBPF) was the initial version of a Motorola inspired virtual ISA (i.e.: had no hardware implementation – think CHIP-8). eBPF is basically still BPF but more compatible with 64-bit architectures so that Just In Time (JIT) translators have an easier time running the code.
At first, the whole idea was to compile packet filtering programs and attach them to sockets in kernelspace. These programs would filter out packets that userspace processes would not be interested in. Consequently, this would reduce the quantity of data copied over the kernelspace/userspace boundary, only to ultimately be discarded.
Today, eBPF is used heavily for system profiling by companies such as Netflix and Facebook. Linux has had a kernel VM capable of running and statically analyzing eBPF code since around 2006. tcpdump is one of the few examples that still use it for its original purpose.
Use tcpdump to output outgoing NTP queries and incoming http(s) responses. Use the -d flag to see an objdump of the filter program's code.
Complete the tcpdump command in order to satisfy the following formatting requirements:
How to test:
$ ntpdate -q ro.pool.ntp.org
$ curl ocw.cs.pub.ro
-D. In addition to your network interfaces, you may also see a Bluetooth device or the dbus-system (depending on your desktop).
If you don't specify the interface with -i, the first entry in the printed list will be used by default. This may not always be your active network interface but in stead, your docker bridge (for example).
iptables is a configuration tool for the kernel packet filter.
The system as a whole provides many functionalities that are grouped by tables: filter, nat, mangle, raw, security. If you want to alter a packet header, you place a rule in the mangle table. If you want to mask the private IP address of an internal host with the external IP address of the default gateway, you place a rule in the nat table. Depending on the table you choose, you will gain or lose access to some chains. If not specified, the default is the filter table.
Chains are basically lists of rules. The five built-in chains are PREROUTING, FORWARD, POSTROUTING, INPUT, OUTPUT. Each of these corresponds to certain locations in the network stack where packets trigger Netfilter hooks (here is the PREROUTING kernel hook as an example – not that hard to add one, right?) For a selected chain, the order in which the rules are evaluated is determined primarily by the priority of their tables and secondarily by the user's discretionary arrangement (i.e.: order in which rules are inserted).
A rule consists of two entities: a sequence of match criteria and a jump target.
The jump target represents an action to be taken. You are most likely familiar with the built-in actions such as ACCEPT or DROP. These actions decide the ultimate fate of the packet and are final (i.e.: rule iteration stops when these are invoked). However, there are also extended actions (see man iptables-extensions(8)) that are not terminal verdicts and can be used for various tasks such as auditing, forced checksum recalculation or removal of Explicit Congestion Notification (ECN) bits.
The match criteria of every rule are checked to determine if the jump target is applied. The way this is designed is very elegant: every type of feature (e.g.: Layer 3 IP address vs Layer 4 port) that you can check has a match callback function defined in the kernel. If you want, you can write your own such function in a Linux Kernel Module (LKM) and thus extend the functionality of iptables (Writing Netfilter Modules with code example). However, you will need to implement a userspace shared library counterpart. When you start an iptables process, it searches in /usr/lib/xtables/ and automatically loads certain shared libraries (note: this path can be overwritten or extended using the XTABLES_LIBDIR environment variable). Each library there must do three things:
iptables --help is called (its help message is an amalgamation of each library's help snippet).So when you want to test the efficiency of the iptables rule evaluation process, keep in mind that each rule may imply the invocation of multiple callbacks such as this.
Write an iptables rule according to the following specifications:
How to test:
$ sudo curl www.google.com $ sudo dmesg
multiport, owner modules
$ man 8 iptables-extensions
Write an iptables rule according to the following specifications:
Continue appending the same rule with incremented TTL value until the DNS request goes through.
How to test:
$ dig +short fep.grid.pub.ro @8.8.8.8
$ man 8 iptables-extensions nfbpf_compile
If you are working on Ubuntu, there is a chance that nfbpf_compile did not come with the iptables package (oh Canonical… maybe there's something in the Universe repos?).
Anyway, you can still install it manually:
$ sudo apt install libpcap-dev $ wget https://raw.githubusercontent.com/netgroup-polito/iptables/master/utils/nfbpf_compile.c $ gcc -o nfbpf_compile nfbpf_compile.c -lpcap
Also, use this man page rather than installing it separately.
This rule uses the TTL target, which is only valid in a certain table. If you forget it, iptables will accept your command silently and still fail at kernel level. You won't see an error in the terminal — you'll see this:
iptables: Invalid argument. Run `dmesg' for more information.
Check dmesg whenever iptables gives you “Invalid argument”. You'll find the actual error there.
This is intentional behavior: the kernel module that handles the TTL target implements a rule check callback that validates the structure received from userspace. It doesn't trust you. If something is wrong, it logs to the kernel ring buffer — so dmesg is always your first stop when debugging iptables rules.
Give an example when iptables is unable to catch a packet.
The Address Resolution Protocol (ARP) resolves layer 2 addresses (MAC) from layer 3 addresses (e.g.: IP). Normally, all hosts are compelled to reply to ARP requests, but this can be fiddled with using tools such as arptables. You can show the currently known neighbors using iproute2.
$ ip -c neigh show
# alias for iproute2 color output alias ip='ip -c'
The Internet Control Message Protocol (ICMP) is an ancillary protocol meant mainly to report errors between hosts. Sometimes it can also be used to perform measurements (ping) or to inform network participants of better routes (Redirect Messages). There are many ICMP functionalities, most of which are now deprecated. Note that some network equipment may not be capable of understanding new and officially recognized protocols, while other may not even recognize experimental ICMP codepoints (i.e.: type=253,254) and simply drop the packet. Because ICMP can be used to stage attacks in a network, some operating systems (e.g.: Windows ≥7) went so far as to disable Echo Replies by default.
Use arp-scan to scan your local network while monitoring ARP traffic with wireshark to get a sense of what's going on. After that, use the following script to identify hosts discoverable via ARP but not ICMP.
nmap is a network exploration tool and a port scanner. Today, we will look only at a specific functionality that it shares with the traceroute utility.
Route discovery is simple in principle: IPv4 packets have a Time to Live (TTL) field that is decremented by 1 with each hop, thus ensuring a limited packet lifespan (imagine routing loops without TTL). Even if the TTL is 0, the layer 3 network equipment must process the received packet (the destination host can accept a packet with TTL=0). Routers may check the TTL field only if they are to forward the packet. If the TTL is already 0, the packet is dropped and a ICMP Time-To-Live Exceeded message is issued to the source IP. By sending packets with incrementally larger TTL values, it is possible to obtain the IP of each router on the path (at least in theory).
With 8.8.8.8 as a target, use wireshark to view the traffic generated by both nmap and traceroute. What differences can you find in their default mode of operation?
$ sudo nmap \ -sn `# disable port scan` \ -Pn `# disable host discovery` \ -tr `# perform traceroute` \ 8.8.8.8 $ traceroute 8.8.8.8
sudo snap remove nmap && sudo apt install nmapsnap connect nmap:network-control
If we do allow for a port scan by removing -sn (default is a TCP-based scan; use -sU for a UDP scan), this will take place before the actual traceroute. What changes does this bring?
When doing the TCP scan with nmap, you may have noticed a weird field in the TCP header: Options. Generate some TCP traffic with curl and look at the SYN packet in wireshark. What options do you see there?
Here is a quick break down of the more common TCP options and how they are used to overcome protocol limitations and improve throughput. Take a quick look if you want, then move on. We'll dive deeper into protocol options in the next task.
Earlier in Ex. 1, we mentioned that eBPF is used for more than traffic filtering. Some of you may have heard of the eXpress Data Path (XDP) or the more recent eXpress Resubmission Path (XRP). Both of these are eBPF-powered shunts of kernel data paths that are used to optimize the system for very specific types of workloads. We'll return to these in a future lecture (and maybe a lab as well) since they can be considered advanced topics. For now, we'll focus on the third purpose eBPF can serve: execution tracing.
pwru is a tool created by Cilium to help trace network packets in the kernel's network stack and debug network connectivity issues. It does this by attaching simple eBPF programs to certain function entry points. These programs can report back to a userspace process different kinds of information, including the function that was reached, the arguments that were passed, and a CPU clock timestamp. The method used for instrumenting kernel code is based on kprobes. Ask your assistant for more information.
Installation — build from source
Pre-built packages are no longer maintained for most distributions, so you'll build pwru from source. All you need is a Go compiler and make.
# Install Go if you don't have it $ sudo apt install golang-go # Ubuntu/Debian # or follow https://go.dev/dl/ for the latest version # Clone and build $ git clone https://github.com/cilium/pwru.git $ cd pwru $ make $ sudo mv pwru /usr/local/bin/ The build takes about a minute on first run (Go downloads dependencies). The result is a statically linked binary with no runtime dependencies. **Minimum requirements** (check before running): * Linux kernel ≥ 5.5 (for BTF support): ''uname -r'' * BTF enabled: ''ls /sys/kernel/btf/vmlinux'' — file must exist * ''bpf'' filesystem mounted: ''mount | grep bpf'' If BTF is missing, ''pwru'' will fail immediately with a clear error message.
Now, trace all outgoing DNS queries to the Google DNS (i.e.: 8.8.8.8) and perform one using dig. Add relative timestamps to the individual trace entries, to get an idea of the computational cost of each operation.
Finally, insert an iptables rule on the OUTPUT chain that drops DNS queries to 8.8.8.8 and redo the experiment. Check where the packet's path is cut short (the reason should be obvious :p).
systemd-resolved may intercept your query before it reaches the network. If pwru shows nothing, try:
$ sudo systemd-resolve --flush-caches
or target 127.0.0.53 to confirm caching is the issue.
Analyze the call path in the kernel network stack for the first scenario (when the packet actually made it out). Explain each step of the packet's journey.
To structure your analysis, answer these questions in order:
nf_hook_slow in the trace. Which Netfilter hook point does it correspond to (refer back to Figure 1 from Task 01)?
In Lab 05 you used bpftrace exclusively via one-liners (-e flag). That works fine for quick investigations, but as your probes get more complex (multiple hooks, conditionals, helper functions) you'll want to write proper script files (.bt extension).
The difference is minimal syntactically, but it is quite important in practice: a script file can have comments, be version-controlled, be shared with teammates, and be run with sudo bpftrace script.bt without the shell escaping headaches that come with one-liners.
In this task you'll write two scripts targeting functions you observed in your pwru trace from Exercise 03.
iptables state. Remove any DROP rules you added in the previous exercise:
$ sudo iptables -D OUTPUT -p udp -d 8.8.8.8 --dport 53 -j DROP
Before writing your own scripts, study this example. It is not a task — there is nothing to submit. It exists to show what a well-structured .bt script looks like, so you have a reference when writing the next two.
You can also find out more about the bpftrace coding style here
#!/usr/bin/bpftrace BEGIN { printf("Tracing nf_hook_slow... Ctrl+C to stop.\n\n"); } /* fentry fires at the entry of the kernel function. * Faster and lower-overhead than kprobe. * 'comm' is a bpftrace built-in: the name of the current process. */ fentry:nf_hook_slow { @invocations_by_process[comm]++; } /* Print and reset every 3 seconds */ interval:s:3 { printf("-- %s --\n", strftime("%H:%M:%S", nsecs)); print(@invocations_by_process); printf("\n"); clear(@invocations_by_process); } END { printf("Done.\n"); }
Run it for a few seconds while generating some traffic and observe the output. Then read through the script again. This is the style expected in Task B.
nf_hook_slow(), which is visible in your pwru trace from Task 03, is the function that walks the iptables rule chain for every packet. Its cost is not fixed: it scales with the number of rules in the chain, and within each rule, with the number of match flags specified. A match rule such as -p tcp -d 8.8.8.8 –dport 443 invokes three separate match callbacks in sequence; if any returns false, evaluation stops for that rule and moves on to the next one. On a long chain, this adds up.
A common real-world mistake: a sysadmin responds to unwanted traffic by adding one DROP rule per offending source IP, one at a time, instead of a single rule covering the entire prefix. After hours or days of this, the chain has thousands of rules. Every packet, regardless of its actual destination, must walk the entire chain before reaching the default policy. On a modest server, this is enough to cause visible throughput degradation.
You are going to reproduce this and measure it.
matplotlib and optionally pandas for plotting.Running a local server eliminates network variability from the experiment, the iptables overhead signal becomes much cleaner and easier to observe in the plot. Pick one of the two options below depending on your setup.
Option 1: Docker container with Arch Linux
If you have Docker installed, you can spin up an Arch Linux container. This container will use the same TCP/IP stack as the host, but will have distinct network devices, routing tables, firewall rules, etc. Any packet that leaves the container will have to pass through the network stack twice.
# start the container host$ docker run -ti --rm archlinux # show IP address of container and run iperf3 arch$ pacman -Sy --noconfirm iperf3 iproute2 arch$ ip -c a s arch$ iperf -s # test if it works (should have >40Gbps throughput) host$ iperf3 -c ${container_ip} -p 5201 -t 5
Option 2: network namespace (no Docker required)
This creates an isolated network environment using Linux network namespaces and a virtual Ethernet pair (veth), exactly like Docker does internally. See RL Lab 10 for a deeper dive into how this works.
# 1. Create the namespace $ sudo ip netns add iperf3-ns # 2. Create a veth pair: one end stays on the host, one goes into the namespace $ sudo ip link add veth-host type veth peer name veth-ns # 3. Move one end into the namespace $ sudo ip link set veth-ns netns iperf3-ns # 4. Configure the host-side interface $ sudo ip addr add 10.99.0.1/24 dev veth-host $ sudo ip link set veth-host up # 5. Configure the namespace-side interface $ sudo ip netns exec iperf3-ns ip addr add 10.99.0.2/24 dev veth-ns $ sudo ip netns exec iperf3-ns ip link set veth-ns up $ sudo ip netns exec iperf3-ns ip link set lo up # 6. Start iperf3 server inside the namespace (background) $ sudo ip netns exec iperf3-ns iperf3 -s -D # 7. Test from the host (server is at 10.99.0.2) $ iperf3 -c 10.99.0.2 -p 5201 -t 5
Traffic from the host to 10.99.0.2 is routed through the kernel's normal IP output path and hits the OUTPUT chain where nf_hook_slow is instrumented correctly.
When done with the experiment:
$ sudo ip netns delete iperf3-ns $ sudo ip link delete veth-host
Write a bpftrace script of your own that calculates the average time each packet spent being evaluated in nf_hook_slow().
kprobe/kretprobe instead of fentry/fexit for portability, since kprobes work on kernels without full BTF support, which some VMs lack. The instrumentation overhead is slightly higher, but overall negligible.
Run a 5-10s iperf3 throughput test between your host and the container. Meanwhile, use the script that you've written to measure the latency introduced by the OUTPUT Netfilter chain hook.
Having no rules configured on your OUTPUT chain, this will serve as a baseline. Next, redo this experiment by continuously adding 100 iptables rules that are guaranteed to fail (i.e., verdict will never be obtained until all rules are evaluated). Repeat these steps until you end up with ~3,000 rules in your OUTPUT chain. Save all these results (number of rules, average throughput, average Netfilter-induced latency) since you will have to plot them.
Try to script this, since manually re-running all of this is very tiresome!
$ sudo iptables -F OUTPUT
Write a Python script that creates two plots in the same figure:
Answer the following:
nf_hook_slow() linear with rule count? What does this tell you about the algorithm used to walk the chain?Please take a minute to fill in the feedback form for this lab.