Kernel development

Contents

Tasks

01. [??p] Prerequisites

[??p] Task A - Dependencies

Before we get down to it, there are a few packages that need to be installed. The following commands should work on Ubuntu 20.04 Desktop. While most of these probably came with the base system, we'll enumerate them here for the sake of completeness. If you're using other desktop environments, most of these should have a correspondent (probably with an identical name).

[student@host]$ sudo apt update
[student@host]$ sudo apt install curl make gcc git iptables dnsutils qemu-system debootstrap libxtables-dev

In case you're wandering what any of these are:

  • curl: CLI tool for fetching (HTTP) data from servers
  • make: project build automation tool
  • gcc: GNU C compiler
  • git: CLI tool for content tracking & versioning
  • iptables: user space interface to the kernel's packet filter
  • dnsutils: contains dig which we'll need to send out DNS queries
  • qemu-system: PC system emulator; if you don't have space for all off them, install only qemu-system-x86 or -arm
  • debootstrap: user space environment bootstrapping tool
  • libxtables-dev: development files for the packet filtering framework; needed for “xtables.h”

Moving forward, we prepared a code skeleton for you: skeleton.zip. It's structure is as follows:

[student@host]$ tree -L 2 skeleton 
skeleton/                     --> root of workspace
├── 02/                        --> task 2: basically a "Hello World"
│   ├── Kbuild
│   ├── Makefile
│   ├── my_first_module.c
│   └── patches/
├── 03/                        --> task 3: an iptables extension
│   ├── include/
│   ├── module/
│   └── plugin/
└── images/                    --> task 1: VM hard disk images (empty)
 
7 directories, 3 files

At this time, there's no point in taking a more detailed look at each file. In stead, let us focus on the task at hand and prepare our testing environment.

[??p] Task B - Development environment

When developing new features for the kernel, chances are that you will screw up. Often. Depending on the severity, the kernel may or may not recover. So to avoid restarting your PC over and over, it's better to work in a minimal virtualized environment. As such, we will first bootstrap a loopback disk image with a basic Ubuntu system, but without the kernel. Eventually, we will boot a virtual machine with qemu-system-x86_64 from this disk image, with a custom kernel that we will build ourselves.

The bootstrapping and kernel building process may take a few (~15) minutes. Feel free to jump to Exercise 2 and come back once in a while to see if any progress was made. While Task A is doable on your live kernel, make sure to stop there. Task B is meant to generate errors and should be solved in the VM. For your sake :p

Bootstrapping

For the bootstrapping process, we will use debootstrap. This tool will download a Debian-based ecosystem and install it in whatever directory we tell it to. Incidentally, that directory will be the mount point of the disk image that we are going to create.

# create a 5GB empty file -- this will be our disk image
[student@host]$ qemu-img create images/ubuntu.raw 5G
 
# build an ext4 filesystem onto the disk image -- now we can mount it
[student@host]$ mkfs.ext4 images/ubuntu.raw
 
# mount the ext4 filesystem -- now we can copy files onto it
[student@host]$ sudo mount images/ubuntu.raw /mnt
 
# bootstrat the Ubuntu system
[student@host]$ sudo debootstrap --arch amd64 focal /mnt http://archive.ubuntu.com/ubuntu

Almost there… if you list the contents of /mnt/, you will see most of the usual entries from your root directory. At this point, we should be able to boot into this machine (if we had a kernel image), but we wouldn't be able to log in. The only thing that's left for us to do is set a password for the root user. For this, we need to trick the passwd tool to thing that /mnt/ is in fact the root of our filesystem. So we use chroot:

# pretend that /mnt/ is our new / and start a bash instance inside
[student@host]$ sudo chroot /mnt /bin/bash
 
# change password for current user (root)
[   root@jail]$ passwd
New password: root
Retype new password: root
passwd: password updated successfully
 
# while we're here, set Google DNS as primary DNS
# qemu has a bug where it refuses to fall back to other resolvers
# so it will be hard stuck on 127.0.0.53 ==> can't resolve domain names
[   root@jail]$ echo 'nameserver 8.8.8.8' > /etc/resolv.conf
 
# exit from this bash instance and escape from the chroot jail
[   root@jail]$ exit
 
# finally, unmount our disk -- we're done with it for now
[student@host]$ sudo umount /mnt

Kernel building

Next step is to get the kernel source code and compile it. By separating the kernel from the disk image, we are able to checkout to other branches / commits and test out different versions without installing them anywhere. Normally, you would have to select what options you want included in the compilation process (e.g.: memory allocators, cryptographic systems, etc.) by running make menuconfig. After finishing your selection and saving the configuration, a .config file would be created. Because we haven't the faintest idea what most of the things enumerated in that menu even are, we will rely on default configurations. One thing that is not part of the default configuration and will be useful later are debug symbols. Go through the following commands and refer to the GIF below when you'll be required to navigate the config menu.

# clone the linux kernel locally
[student@host]$ git clone --depth=1 https://github.com/torvalds/linux.git
 
# go into the repo directory
[student@host]$ pushd linux
 
# create a default configuration file (.config)
[student@host]$ make x86_64_defconfig kvm_guest.config
 
# manually enable debug symbols on top of current .config
# NOTE: refer to the GIF below
[student@host]$ make menuconfig
 
# optional: check out the generated .config file
[student@host]$ less .config
 
# compile the kernel using all cores
[student@host]$ make -j $(nproc)
 
# return to the previous direcotry
[student@host]$ popd

Click GIF to maximize.

Booting up the virtual machine

We are finally here. Let's boot up the VM from our bootstrapped disk image, with our personally compiled Linux kernel.

[student@host]$ sudo qemu-system-x86_64                            \
                  -m 4G                                            \
                  -smp 1                                           \
                  -enable-kvm                                      \
                  -kernel linux/arch/x86/boot/bzImage              \
                  -drive file=images/ubuntu.raw,format=raw,index=0 \
                  -append 'root=/dev/sda rw console=ttyS0 nokaslr' \
                  -nographic

Let us have a look at this command, line by line:

  1. -m 4G: allocate 4GB of memory (change this as you wish)
  2. -smp 1: use only 1 vCPU; this is recommended for debugging purposes
  3. -enable-kvm: KVM is a Linux kernel module that transforms your operating system intro a bare-metal hypervisor. This is what allows you to run actual virtual machines on Linux. Without it, qemu would try to emulate the system, resulting in worse performance.
  4. -kernel …/bzImage: this specifies the compiled & compressed kernel image to use when booting the virtual machine
  5. -drive … : specifies the disk image to load; note that index=0 will make the VM consider this to be /dev/sda. Adding another drive with index=1 will cause it to be regarded as /dev/sdb.
  6. -append …: these are command line arguments for the kernel (yes, even it has those). root=/dev/sda rw marks /dev/sda (i.e.: our Ubuntu.raw disk image) as the root device to be mounted onto the root directory (i.e.: /) in read-write mode. console=ttyS0 exposes an UART serial interface to the VM and tells Linux to use it for I/O. nokaslr tells the kernel to disable address space layout randomization.
  7. -nographic: tells qemu: not to open a separate window for the GUI. In stead, it will take the virtual serial device (which the VM will recognize as ttyS0) and link it to the terminal. So whatever the VM sends via the serial to be printed will end out in your stdout. Whatever you type into stdin will be forwarded to the VM as input.

If you have problems with the VM booting and you can't <Ctrl-C> out of it, try <Ctrl+A X> to signal qemu that is time to exit. Note that if you feel something odd happening with your terminal (e.g.: overlapping lines), you can run reset.

Under normal circumstances, exit the VM by running poweroff.

After starting the VM and logging in as root (with the password that was set earlier), try finding out the kernel version in both the host and guest operating systems:

# host has the latest Arch Linux kernel (you may have Ubuntu, etc.)
[student@host]$ uname -r
5.15.2-arch1-1
 
# guest has the newest Linux kernel release candidate
[  root@guest]$ uname -r
5.16.0-rc2+

Note how we did not specify a network device to qemu. By default, SLiRP is used to provide network connectivity. If you've never heard of SLiRP, don't feel bad. It's a program that emulates Point-to-Protocol (PPP) using shell accounts and has become largely obsolete with the advent of dial-up modems (I kid you not). While it does provide TCP and UDP connectivity, note that ICMP packets will be dropped and your VM will not be discoverable; not even from your host.

The correct way of doing things would be creating a bridge (i.e.: a software layer-2 switch) with brctl, adding a network device to your VM via the -netdev flag, and attaching it to the newly created bridge. This is a bit overkill for our purpose today. If you ever need to create such a setup, there are plenty of resources available.


Although we said that you should have network access in your VM, there may be a chance that you don't have an IP address assigned. You may need to do this manually:

# list available interfaces (in a colorful fashion)
[root@guest]$ ip -c addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
 
# send a DHCP request on your Ethernet interface
[root@guest]$ dhclient enp0s3
 
# check if an IP address was allocated
[root@guest]$ ip -c addr show enp0s3
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
       valid_lft 86313sec preferred_lft 86313sec
    inet6 fec0::5054:ff:fe12:3456/64 scope site dynamic mngtmpaddr 
       valid_lft 86318sec preferred_lft 14318sec
    inet6 fe80::5054:ff:fe12:3456/64 scope link 
       valid_lft forever preferred_lft forever

02. [??p] Kernel modules

Way back when, kernels used to be monolithic, meaning that adding new functionality required recompiling and installing it, followed by a reboot. Today, things are much easier. By using the kmod daemon (man 8 kmod), users are allowed to load and unload modules (i.e.: kernel object files) on demand, without all the fuss. These modules are C programs that must implement initialization and removal functions that are called automatically. Usually, these functions register / unregister other functions contained in your object with core kernel systems.

We can use lsmod to get a list of all present modules, and modinfo to obtain detailed information about a specific module.

[student@host]$ lsmod
ecdh_generic           16384  1 bluetooth
 
[student@host]$ modinfo ecdh_generic | grep description
description:    ECDH generic algorithm
 
[student@host]$ modinfo bluetooth | grep description 
description:    Bluetooth Core ver 2.22

What we can understand from this is that the Elliptic Curve Diffie-Hellman module is 16384 bytes in size and is used by one other module, via the bluetooth ECDH helper. As you probably noticed, elixir.bootlin.com is a critical resource in navigating the kernel code.

[??p] Task A - Our first module

Looking in the skel/01/ directory from our code skeleton, we will find a minimal build environment for our first module. Alas, compiling a kernel module differs from compiling a user space program. But just slightly: kernel-specific headers must be used, user space-specific libraries (e.g.: libc) are generally unavailable (so no printf()) and lastly, the same options that were used to compile the kernel itself must be specified. To this end, the kbuild system was introduced. As you can see, our Makefile invokes its correpsondent from the kernel source directory in /lib/modules/..., which in turn uses the configuration in our Kbuild file. The obj-m variable specifies the name of the final output object file (in this case, test.o). test-objs contains a sequence of dependent object files, so if you split your code across multiple sources, just add them to test-objs. If you have a single source, you can drop test-objs but the kbuild system will expect a test.c file to be present.

Now, let's compile our module, upload it into the kernel, and see what happens:

[student@host]$ make
 
[student@host]$ sudo insmod test.ko
[student@host]$ sudo dmesg
...
[ 6348.461247] my-first-module: Hello world!
 
[student@host]$ sudo rmmod test
[student@host]$ sudo dmesg
...
[ 6348.461247] my-first-module: Hello world!
[ 6366.635090] my-first-module: Goodbye cruel, cruel world!

Here, we used insmod to upload a .ko kernel object file into the kernel proper and rmmod to remove it. dmesg is a tool that prints the kernel message buffer. Note that there are multiple log levels ranging from debug to emergency. pr_info() is the kernel's printf() variant that corresponds to one of the less urgent levels. dmesg can be configured to squelch messages under a certain level but depending on how your kernel was compiled, some of the more important messages will also be echoed to your terminal.

[??p] Task B - Debugging (1)

In this task we are going to add a bug to our initial module. We will do this by applying a diffpatch to our source:

[student@host]$ patch my_first_module.c patches/add_bug.patch

Now, our module has a 50% chance to dereference a NULL pointer every time we try to load it. If this happens, a kernel oops will occur. While no error is truly harmless, an oops is more so than a kernel panic. The difference between the two is that the system can recover from a kernel oops, but not from a kernel panic. The Windows equivalent of a kernel panic would be a Blue Screen Of Death.

Knowing that our module will cause trouble, we should test it inside the VM. In order to do this, we need to recompile it using the Makefile in the Linux repo that we cloned. For this, we overwrite the KDIR variable used in our module's Makefile.

# clean up previously created objects
[student@host]$ make clean
 
# recompile the module, but for the kernel used in the VM; not your live kernel
[student@host]$ KDIR=$(realpath ../linux/) make

Now, we need to get test.ko onto the VM. First of all, if it's still running, kill it. Next, we are going to once again mount the disk image and copy the kernel object in the root home directory. Doing so on a live partition might be a bit trickier :p

# stop the VM if it's still running
[  root@guest]$ poweroff
 
# once again, mount the VM disk image
[student@host]$ sudo mount ../images/ubuntu.raw /mnt
 
# copy the module in the VM's root home
[student@host]$ sudo cp test.ko /mnt/root
 
# unmount the disk before starting the VM again
[student@host]$ sudo umount /mnt

Finally, start up qemu once again and notice that test.ko is in /root/. Try to load it with insmod until you get an error like this:

Click to display ⇲

Click to hide ⇱

root@victim:~# insmod test.ko
[   26.083587] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   26.084413] #PF: supervisor write access in kernel mode
[   26.085044] #PF: error_code(0x0002) - not-present page
[   26.085663] PGD 0 P4D 0
[   26.085972] Oops: 0002 [#1] PREEMPT SMP PTI
[   26.086475] CPU: 0 PID: 212 Comm: insmod Tainted: G           O      5.16.0-rc2+ #1
[   26.087385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[   26.088487] RIP: 0010:init+0x3f/0x70 [test]
[   26.089000] Code: 24 08 31 c0 48 8d 7c 24 07 e8 7d dd dd d2 0f b6 74 24 07 48 c7 c7 00 d0 34 c0 e8 74 0
[   26.091188] RSP: 0018:ffff96c8c01cfde0 EFLAGS: 00010282
[   26.091813] RAX: 0000000000000023 RBX: 0000000000000000 RCX: 0000000000000000
[   26.092656] RDX: 0000000000000000 RSI: ffffffff940390f9 RDI: 00000000ffffffff
[   26.093496] RBP: ffffffffc034c000 R08: ffffffff94335c88 R09: 00000000ffffdfff
[   26.094344] R10: ffffffff94255ca0 R11: ffffffff94255ca0 R12: 0000000000000000
[   26.095185] R13: ffff899ac4cbe4a0 R14: 0000000000000003 R15: 0000000000000000
[   26.096044] FS:  00007f00a9916540(0000) GS:ffff899afbc00000(0000) knlGS:0000000000000000
[   26.097013] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.097701] CR2: 0000000000000000 CR3: 00000001027ba000 CR4: 00000000000006f0
[   26.098545] Call Trace:
[   26.098860]  <TASK>
[   26.099125]  do_one_initcall+0x3f/0x1e0
[   26.099608]  ? kmem_cache_alloc_trace+0x3a/0x1b0
[   26.100164]  do_init_module+0x56/0x240
[   26.100617]  __do_sys_finit_module+0xa0/0xe0
[   26.101139]  do_syscall_64+0x3b/0x90
[   26.101595]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   26.102224] RIP: 0033:0x7f00a9a5b70d
[   26.102658] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 8
[   26.104854] RSP: 002b:00007ffd249707f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   26.105751] RAX: ffffffffffffffda RBX: 000055a5530f3490 RCX: 00007f00a9a5b70d
[   26.106608] RDX: 0000000000000000 RSI: 000055a55286a358 RDI: 0000000000000003
[   26.107441] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f00a9b2f260
[   26.108290] R10: 0000000000000003 R11: 0000000000000246 R12: 000055a55286a358
[   26.109155] R13: 0000000000000000 R14: 000055a5530f2400 R15: 0000000000000000
[   26.110005]  </TASK>
[   26.110274] Modules linked in: test(O+)
[   26.110733] CR2: 0000000000000000
[   26.111178] ---[ end trace d28d04e4e0f18a50 ]---

This info dump may be intimidating at first sight, but it contains all the necessary information to identify the problem:

  • BUG: kernel NULL pointer dereference, address: 0000000000000000: the reason behind the oops.
  • #PF: supervisor write access in kernel mode: when dereferencing the virtual address 0x00, the MMU tried to find the corresponding physical page address, but failed. Remember that #PF stands for Page Fault.
  • RIP: 0010:init+0x3f/0x70 [test]: the faulting instruction was located in the test module, at an offset of 0x3f from the start of the init() function, which has a total size of 0x70 bytes.

Based on this information (especially the last part), we have a few ways of identifying the exact line of code and instruction where the module crashed. First one up, is addr2line. This tool can convert an address to a source file line number, given that the binary was compiled with debug symbols. We already know that the instruction was located at an offset of 0x3f from the init() function, but where was this function located relative to the beginning of the object? This can be easily discovered by consulting its symbol table with readelf.

# where is init() located relative to the start of the object file?
[student@host]$ readelf --symbols test.ko
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    ...
    24: 0000000000000000   102 FUNC    LOCAL  DEFAULT    1 init
    ...
# apparently right at the very start ==> our instruction is at address 0x00 + 0x3f = 0x3f
 
# what line from what source file generated the instruction at address 0x3f?
[student@host]$ addr2line --exe test.ko 0x3f
/.../my_first_module.c:26

Another way of identifying not only the source code line, but also the instruction is by using a tool that may be familiar to you: objdump. This is a binary file disassembler. Next, we are going to disassemble (-d) only the .text section (a.k.a. the code section), displaying the instruction mnemonics in Intel syntax (-M intel) and interlacing the C code that generated these instructions (-S).

# looking for that elusive 3f offset...
[student@host]$ objdump -d -M intel -S test.ko
 ...
    /* we have a 50-50 chance to shoot ourselves in the foot */
    if (random & 0x80) {
  34:   80 7c 24 07 00          cmp    BYTE PTR [rsp+0x7],0x0
  39:   0f 89 00 00 00 00       jns    3f <init_module+0x3f>
        *((uint8_t *) NULL) = 0xff;
  3f:   c6 04 25 00 00 00 00    mov    BYTE PTR ds:0x0,0xff
  46:   ff 
    } else {
 ...

Let's say a module generates an oops. Even if the kernel recovers, that module will be locked in place until reboot. If you try to rmmod it, the kernel will claim that it's still in use. For our example:

# is our module still loaded?
[root@guest]$ lsmod | grep test
test                   16384  1
 
# can we remove the module?
[root@guest]$ rmmod test
rmmod: ERROR: Module test is in use
 
# looks like the module crashed while in the "Loading" state
# the kernel was trying to load it at address 0xffffffffc0304000
[root@guest]$ cat /proc/modules
test 20480 1 - Loading 0xffffffffc0304000 (O+)

In this case, the best course of action is to simply reboot, but if you want to risk it, just run rmmod -f to force unloading the module. Be warned that this is very dangerous. Also, if the module is still there after using the -f flag, make sure that CONFIG_MODULE_FORCE_UNLOAD was set at compile time.

# check if force unloading the module is an option
[root@guest]$ zcat /proc/config.gz | grep CONFIG_MODULE_FORCE_UNLOAD
CONFIG_MODULE_FORCE_UNLOAD=y
 
# force unload the kernel module
[root@guest]$ rmmod -f test

03. [??p] Extending the Linux firewall

Pro tip: since you may want to consult the man pages at some point, add this to your .bashrc or .zshrc:

# color schemes for man pages
man() {
    LESS_TERMCAP_mb=$'\e[1;34m'   \
    LESS_TERMCAP_md=$'\e[1;32m'   \
    LESS_TERMCAP_so=$'\e[1;33m'   \
    LESS_TERMCAP_us=$'\e[1;4;31m' \
    LESS_TERMCAP_me=$'\e[0m'      \
    LESS_TERMCAP_se=$'\e[0m'      \
    LESS_TERMCAP_ue=$'\e[0m'      \
    command man "$@"
}

Now, source your file to load the new command. man will color certain keywords appropriately.

# update your shell's environment with the man() wrapper
[student@host]$ source ~/.bashrc
 
# check out the manual page for iptables
[student@host]$ man iptables

iptables is a configuration tool for the kernel packet filter.

The system as a whole provides many functionalities that are grouped by tables: filter, nat, mangle, raw, security. If you want to alter a packet header, you place a rule in the mangle table. If you want to mask the private IP address of an internal host with the external IP address of the default gateway, you place a rule in the nat table. Depending on the table you choose, you will gain or lose access to some chains. If not specified, the default is the filter table.

Chains are basically lists of rules. The five built-in chains are PREROUTING, FORWARD, POSTROUTING, INPUT, OUTPUT. Each of these corresponds to certain locations in the network stack where packets trigger Netfilter hooks (here is the PREROUTING kernel hook as an example – not that hard to add one, right?) For a selected chain, the order in which the rules are evaluated is determined primarily by the priority of their tables and secondarily by the user's discretionary arrangement (i.e.: order in which rules are inserted).

A rule consists of two entities: a sequence of match criteria and a jump target.

The jump target represents an action to be taken. You are most likely familiar with the built-in actions such as ACCEPT or DROP. These actions decide the ultimate fate of the packet and are final (i.e.: rule iteration stops when these are invoked). However, there are also extended actions (see man iptables-extensions(8)) that are not terminal verdicts and can be used for various tasks such as auditing, forced checksum recalculation or removal of Explicit Congestion Notification (ECN) bits.

The match criteria of every rule are checked to determine if the jump target is applied. The way this is designed is very elegant: every type of feature (e.g.: layer 3 IP address vs layer 4 port) that you can check has a match callback function defined in the kernel. If you want, you can write your own such function in a Linux Kernel Module (LKM) and thus extend the functionality of iptables (Writing Netfilter Modules with code example). However, you will need to implement a userspace shared library counterpart. When you start an iptables process, it searches in /usr/lib/xtables/ and automatically loads certain shared libraries (note: this path can be overwritten or extended using the XTABLES_LIBDIR environment variable). Each library there must do three things:

  • define iptables flags for the new criteria that you want to include.
  • define help messages for when iptables --help is called (its help message is an amalgamation of each library's help snippet).
  • provide an initialization function for the structure containing the rule parameters; this structure will end up in the kernel's rule chain.

So when you want to test the efficiency of the iptables rule evaluation process, keep in mind that each rule may imply the invocation of multiple callbacks such as this.

[??p] Task A - Primer / Reminder

Before writing our own match module, here's a small task to freshen your memory on how to use iptables.

Write an iptables rule according to the following specifications:

  • chain: OUTPUT
  • match rule: TCP packets originating from ephemeral ports bound to a socket created by root
  • target: enable kernel logging of matched packets with the “TCP_LOG: ” prefix

How to test:

$ sudo curl www.google.com
$ sudo dmesg

multiport, owner modules

$ man 8 iptables-extensions

[??p] Task B - Writing an xtables module

xtables is the backbone of iptables and provides a protocol-agnostic infrastructure for adding match modules. We touched on this topic earlier in the exercise but now we're going to jump right in. Following this brief introduction will be three subsections detailing the data structures, the user space shared library and the kernel module. All these are partially implemented in the 02/ directory and you will have specific TODOs in the source files. By the end, you will have implemented a match filter for requested domains in DNS queries (i.e.: you can block DNS queries for “google.com” but not “kernel.org”, for example). Note that you can solve this task either in the VM, or on your localhost. We won't be doing anything dangerous, like overwriting kernel structures.

If you're not that familiar with DNS, refer to this primer (Sec. 1-3).

The header

We mentioned before that an iptables extension has two components. A kernel module implementing the verification of a rule and a user space library that is able to parse the user's rules (in iptables “syntax”). Considering that our module will be named xt_dns_name (the xt_ part conforming to the naming convention), what links these two elements is the include/xt_dns_name.h header. In it, the xt_dns_name_mtinfo structure will hold all information necessary to the kernel module to match a packet. This structure will be initialized in user space and transferred over to the xtables framework by iptables. Looking closer, we notice two fields. While name will hold the queried domain name, flags specifies what features are enabled for verification. Our module is simple and has only one match criteria, but note that iptables can invert a selection by specifying the ! symbol before it. So we can match a packet either on a queried domain name match, or a mismatch.

Click to display ⇲

Click to hide ⇱

xt_dns_name.h
#ifndef _XT_DNS_NAME_H
#define _XT_DNS_NAME_H
 
/* defines enabled properties to be checked */
enum {
    XT_DNS_NAME     = 1 << 0,   /* search for DNS name match    */
    XT_DNS_NAME_INV = 1 << 1,   /* search for DNS name mismatch */
};
 
/* rule match information */
struct xt_dns_name_mtinfo {
    __u8 flags;
    __u8 name[127];
};
 
#endif /* _XT_DNS_NAME_H */

The __u8 type is equivalent to uint8_t and is a typedef of u8. The kernel defines u8, u16, etc. because C types have different sizes, depending on the underlying CPU architecture. __u8 is used here in stead of u8 to indicate that the header is shared with userspace. Also, the reason why the kernel doesn't simply use uint8_t is because u8 predates stdint.h.

The user space iptables plugin

The plugin consists of a shared library compiled from plugin/libxt_dns_name.c. The source is broken down into four sections:

  • API: These are prototypes of functions that must be made available to iptables to invoke when it needs certain things done. Here is a description of each function:
    • dns_name_mt_help(): When you invoke iptables --help, each plugin will print its own help message. This is our contribution.
    • dns_name_mt_init(): Called before the argument parsing begins. Will zero out our xt_dns_name_mtinfo structure.
    • dns_name_mt_parse(): Based on optarg, this function will be invoked for each argument that has something to do with our plugin. It will update the xt_dns_name_mtinfo structure accordingly on each pass.
    • dns_name_mt_check(): Final check before sending the structure in kernel space. Will verify that all required arguments were provided.
    • dns_name_mt_print(): When invoking iptables -L, this function will print out a rule's match criteria.
    • dns_name_mt_save(): Given a xt_dns_name_mtinfo structure, this function will print out the CLI arguments that would generate such a function. Called upon by iptables-save.
  • MODULE SPECIFICATION STRUCTURES: Here we declare two structures:
    • dns_name_mt_opts : A structure defining the CLI arguments that we accept (see man 3 getopt for details).
    • dns_name_mt_reg : A structure containing function pointers for specific tasks that the module needs to accomplish. We initialize it with the function described in the API section. This kind of structure is usually called a virtual table (or vtable).
  • IPTABLES MODULE CALLBACKS: this is where we actually implement the functions that we added to the vtable. Here, you will search for TODOs.
  • LIBRARY MANAGEMENT FUNCTIONS: When loaded, each library (i.e.: a .so file) can have a number of constructors defined. These constructors are functions called upon by the loader once the library was mapped in virtual memory. Our constructor will invoke xtables_register_match() in order to register the vtable with iptables, letting it know that it has yet another plugin at its disposal.

Click to display ⇲

Click to hide ⇱

libxt_dns_name.c
#include <stdio.h>
#include <stdint.h>
#include <getopt.h>
#include <string.h>
#include <xtables.h>
 
#include "xt_dns_name.h"
 
/******************************************************************************
 ************************************ API *************************************
 ******************************************************************************/
 
static void dns_name_mt_help(void);
static void dns_name_mt_init(struct xt_entry_match *match);
static int  dns_name_mt_parse(int c, char **argv, int invert,
                              unsigned int *flags, const void *entry,
                              struct xt_entry_match **match);
static void dns_name_mt_check(unsigned int flags);
static void dns_name_mt_print(const void *entry,
                              const struct xt_entry_match *match, int numeric);
static void dns_name_mt_save(const void *entry,
                             const struct xt_entry_match *match);
 
/******************************************************************************
 *********************** MODULE SPECIFICATION STRUCTURES **********************
 ******************************************************************************/
 
/* module specific options */
const struct option dns_name_mt_opts[] = {
    { .name="domain", .has_arg=required_argument, .val='1' },
    { NULL },
};
 
/* module userspace extension vtable */
static struct xtables_match dns_name_mt_reg = {
    .version        = XTABLES_VERSION,
    .name           = "dns_name",
    .revision       = 0,
    .family         = NFPROTO_IPV4,
    .size           = XT_ALIGN(sizeof(struct xt_dns_name_mtinfo)),
    .userspacesize  = XT_ALIGN(sizeof(struct xt_dns_name_mtinfo)),
    .help           = dns_name_mt_help,
    .init           = dns_name_mt_init,
    .parse          = dns_name_mt_parse,
    .final_check    = dns_name_mt_check,
    .print          = dns_name_mt_print,
    .save           = dns_name_mt_save,
    .extra_opts     = dns_name_mt_opts,
};
 
/******************************************************************************
 ************************* IPTABLES MODULE CALLBACKS **************************
 ******************************************************************************/
 
/* dns_name_mt_help - prints help message for this module
 */
static void
dns_name_mt_help(void)
{
    printf("dns_name match options\n"
           "[!] --domain <string>\t\tQueried domain name.\n");
}
 
/* dns_name_mt_init - initializes our data struct fields before parsing
 *  @match : pointer to our data struct
 */
static void
dns_name_mt_init(struct xt_entry_match *match)
{
    /* match is an internal structure; we are only interested in data */
    struct xt_dns_name_mtinfo *info = (void *)match->data;
 
    /* zero out structure */
    memset(info, 0, sizeof(*info));
}
 
/* dns_name_mt_parse - called for each module-specific argument
 *  @c      : option id (see .val in dns_name_mt_opts)
 *  @argv   : argv (simple as that)
 *  @invert : 1 if user specified "!" before argument
 *  @flags  : for parser's discretionary use
 *  @entry  : ptr to an ipt_entry struct (don't care)
 *  @match  : contains pointer to our data struct (data field)
 *
 *  @return : true if option was parsed, false otherwise
 */
static int
dns_name_mt_parse(int c, char **argv, int invert, unsigned int *flags,
                  const void *entry, struct xt_entry_match **match)
{
    /* get reference to our xt_dns_name_mtinfo instance */
    struct xt_dns_name_mtinfo *info = (void *)(*match)->data;
 
    /* option-specific parsing */
    switch (c) {
        case '1':       /* --domain */
            /* check for multiple occurrences */
            if (*flags & XT_DNS_NAME)
                xtables_error(PARAMETER_PROBLEM, "xt_dns_name: "
                    "use \"--domain\" only once!");
 
            /* update parser flags and match criteria flags */
            *flags      |= XT_DNS_NAME;
            info->flags |= XT_DNS_NAME;
 
            /* check for match rule inversion */
            if (invert)
                info->flags |= XT_DNS_NAME_INV;
 
            /* initalize info->name                                      *
             * NOTE: argument is in global variable <optarg>             *
             * NOTE: convert the "." characters according to QNAME specs */
 
            /* TODO 1: initialize info->name */
 
            return true;
    }
 
    /* unknown option */
    return false;
}
 
/* dns_name_mt_check - verify that all required options were processed
 *  @flags : the persistent <flags> argument from dns_name_mt_parse()
 */
static void
dns_name_mt_check(unsigned int flags)
{
    if (!(flags & XT_DNS_NAME))
        xtables_error(PARAMETER_PROBLEM, "xt_dns_name: "
            "make sure to specify the \"--domain\" argument!");
}
 
/* dns_name_mt_print - print the match criteria fields for `iptables -L`
 *  @entry   : internal stuff (don't care)
 *  @match   : contains pointer to our data struct (data field)
 *  @numeric : do not resolve IP addresses to host names if true (don't care)
 */
static void
dns_name_mt_print(const void *entry, const struct xt_entry_match *match,
                  int numeric)
{
    const struct xt_dns_name_mtinfo *info = (void *) match->data;
 
    /* check for match rule reversal */
    if (info->flags & XT_DNS_NAME_INV)
        printf("! ");
 
    /* print domain name                                  *
     * NOTE: replace length of labels with "." characters *
     * NOTE: do NOT print a "\n" character                */
 
    /* TODO 2: print info->name */
}
 
/* dns_name_mt_save - print out arguments that generate this rule
 *  @entry : internal stuff (don't care)
 *  @match : contains pointer to our data struct (data field)
 */
static void
dns_name_mt_save(const void *entry, const struct xt_entry_match *match)
{
    const struct xt_dns_name_mtinfo *info = (void *) match->data;
 
    /* check for match rule reversal */
    if (info->flags & XT_DNS_NAME_INV)
        printf("! ");
 
    /* print "--domain" and the argument */
    printf("--domain ");
 
    /* TODO 3: copy paste TODO 2 here */
}
 
/******************************************************************************
 ************************ LIBRARY MANAGEMENT FUNCTIONS ************************
 ******************************************************************************/
 
/* _init - iptables library constructor
 *
 * NOTE: the '_init' symbol is expanded as a macro by iptables
 */
static void
_init(void)
{
    xtables_register_match(&dns_name_mt_reg);
}

When solving some of the TODOs and consulting the DNS format specification from before, it would be useful to have a DNS capture available to you in wireshark.

The kernel space xtables module

The kernel module source is organized similarly to the user space plugin. The dns_name_mt_reg structure acts as a vtable but also includes information about permissible chains and layer 3 protocols that work with our implementation. Specifically, any rule that makes use of this module can be inserted only in the OUTPUT chain, meaning that we can only catch requests originating from our localhost. Moreover, we implement support only for IPv4, not for IPv6. As we can see, this structure is used on module initialization, in dns_name_mt_init(), to register our module with the xtables framework via xt_register_match().

dns_name_check() and dns_name_mt() implement the functionalities required of our module. The former performs checks on each newly inserted rule, or at least on the part that pertains to this module. In other words, it must make sure that a valid domain name (i.e.: ”.”s replaced with length of following label, etc.) was inserted, for example. The latter function is called upon to verify if a packet matches a certain rule. Its first argument does not represent the packet itself, but a socket buffer structure (see also this, and possibly this) that contains this information, in addition to much, much more. We made sure to provide you with pointers to our xt_dns_name_mtinfo structure, but also to the beginning of the IPv4 header. However, it is up to you to implement this logic and obtain a working match module.

Click to display ⇲

Click to hide ⇱

xt_dns_name.c
#include <linux/kernel.h>
#include <linux/netfilter/x_tables.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/module.h>
 
#include "xt_dns_name.h"
 
MODULE_DESCRIPTION("Xtables: DNS query QNAME matching");
MODULE_AUTHOR("Student");
MODULE_LICENSE("GPL");
MODULE_ALIAS("ipt_dns_name");
 
#define MOD_TAG "xt_dns_name: "
 
/******************************************************************************
 ************************************ API *************************************
 ******************************************************************************/
 
static int  dns_name_check(const struct xt_mtchk_param *par);
static bool dns_name_mt(const struct sk_buff *skb, struct xt_action_param *par);
 
/******************************************************************************
 *********************** MODULE SPECIFICATION STRUCTURES **********************
 ******************************************************************************/
 
/* registration information */
static struct xt_match dns_name_mt_reg __read_mostly = {
    .name       = "dns_name",
    .revision   = 0,
    .family     = NFPROTO_IPV4,
    .matchsize  = sizeof(struct xt_dns_name_mtinfo),
    .checkentry = dns_name_check,
    .match      = dns_name_mt,
    .hooks      = 1 << NF_INET_LOCAL_OUT,
    .me         = THIS_MODULE,
};
 
/******************************************************************************
 ************************** XTABLES MODULE CALLBACKS **************************
 ******************************************************************************/
 
/* dns_name_check - checks rule validity
 *  @par : parameters for match extensions
 *
 *  @return : 0 if everything is ok, !0 otherwise
 */
static int
dns_name_check(const struct xt_mtchk_param *par)
{
    const struct xt_dns_name_mtinfo *info = par->matchinfo;
 
    /* TODO 4: userspace is not to be trusted! check inserted rule */
 
    return 0;
}
 
/* dns_name_mt - performs packet match check
 *  @skb : packet buffer information
 *  @par : parameters for matches / targets
 *
 *  @return : true if matched, false otherwise
 */
static bool
dns_name_mt(const struct sk_buff *skb, struct xt_action_param *par)
{
    const struct xt_dns_name_mtinfo *info = par->matchinfo;
    struct iphdr                    *iph  = ip_hdr(skb);
 
    /* TODO 5: be 100% sure that the packet is a DNS request */
 
    /* TODO 6: match check on any & all QNAMEs in request */
 
    return false;
}
 
 
/******************************************************************************
 *********************** MODULE ENTRY & EXIT CALLBACKS ************************
 ******************************************************************************/
 
static int dns_name_mt_init(void)
{
    pr_info(MOD_TAG "loading xt_dns_name module");
    return xt_register_match(&dns_name_mt_reg);
}
 
static void dns_name_mt_exit(void)
{
    pr_info(MOD_TAG "unloading xt_dns_name module");
    xt_unregister_match(&dns_name_mt_reg);
}
 
module_init(dns_name_mt_init);
module_exit(dns_name_mt_exit);

Testing your solution:

[student@host]$ sudo insmod xt_dns_name.ko
 
# depending on your distro, libxt_*.so may be installed in different places
[student@host]$ sudo XTABLES_LIBDIR="$(pwd):/usr/lib/xtables:/usr/lib/x86_64-linux-gnu/xtables" \
                  iptables                                                                      \
                    -m dns_name                                                                 \
                    -I OUTPUT                                                                   \
                    --domain 'fep.grid.pub.ro'                                                  \
                    -j DROP
[student@host]$ dig +short fep.grid.pub.ro @8.8.8.8
 
[student@host]$ sudo iptables -F OUTPUT
[student@host]$ sudo rmmod xt_dns_name

Try running a wireshark instance and filter by “dns” to make sure no queries pass through.
Note that some distros come with a DNS cache preinstalled and might make your filtering rule redundant.

Make sure to add plenty of pr_info() in your match function to make debugging easier.


Remember that although your host is most likely little-endian, the Internet is big-endian. So when accessing data that is larger than 1 byte (e.g.: port number), use the htons() family of functions. They should be readily available to you in the kernel module.

isc/labs/kernel.txt · Last modified: 2024/08/09 14:54 by florin.stancu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0