Tasks
Before we get down to it, there are a few packages that need to be installed. The following commands should work on Ubuntu 20.04 Desktop. While most of these probably came with the base system, we'll enumerate them here for the sake of completeness. If you're using other desktop environments, most of these should have a correspondent (probably with an identical name).
[student@host]$ sudo apt update [student@host]$ sudo apt install curl make gcc git iptables dnsutils qemu-system debootstrap libxtables-dev
In case you're wandering what any of these are:
curl
: CLI tool for fetching (HTTP) data from serversmake
: project build automation toolgcc
: GNU C compilergit
: CLI tool for content tracking & versioningiptables
: user space interface to the kernel's packet filterdnsutils
: contains dig which we'll need to send out DNS queriesqemu-system
: PC system emulator; if you don't have space for all off them, install only qemu-system-x86 or -armdebootstrap
: user space environment bootstrapping toollibxtables-dev
: development files for the packet filtering framework; needed for “xtables.h”Moving forward, we prepared a code skeleton for you: skeleton.zip. It's structure is as follows:
[student@host]$ tree -L 2 skeleton skeleton/ --> root of workspace ├── 02/ --> task 2: basically a "Hello World" │ ├── Kbuild │ ├── Makefile │ ├── my_first_module.c │ └── patches/ ├── 03/ --> task 3: an iptables extension │ ├── include/ │ ├── module/ │ └── plugin/ └── images/ --> task 1: VM hard disk images (empty) 7 directories, 3 files
At this time, there's no point in taking a more detailed look at each file. In stead, let us focus on the task at hand and prepare our testing environment.
When developing new features for the kernel, chances are that you will screw up. Often. Depending on the severity, the kernel may or may not recover. So to avoid restarting your PC over and over, it's better to work in a minimal virtualized environment. As such, we will first bootstrap a loopback disk image with a basic Ubuntu system, but without the kernel. Eventually, we will boot a virtual machine with qemu-system-x86_64 from this disk image, with a custom kernel that we will build ourselves.
For the bootstrapping process, we will use debootstrap. This tool will download a Debian-based ecosystem and install it in whatever directory we tell it to. Incidentally, that directory will be the mount point of the disk image that we are going to create.
# create a 5GB empty file -- this will be our disk image [student@host]$ qemu-img create images/ubuntu.raw 5G # build an ext4 filesystem onto the disk image -- now we can mount it [student@host]$ mkfs.ext4 images/ubuntu.raw # mount the ext4 filesystem -- now we can copy files onto it [student@host]$ sudo mount images/ubuntu.raw /mnt # bootstrat the Ubuntu system [student@host]$ sudo debootstrap --arch amd64 focal /mnt http://archive.ubuntu.com/ubuntu
Almost there… if you list the contents of /mnt/, you will see most of the usual entries from your root directory. At this point, we should be able to boot into this machine (if we had a kernel image), but we wouldn't be able to log in. The only thing that's left for us to do is set a password for the root user. For this, we need to trick the passwd tool to thing that /mnt/ is in fact the root of our filesystem. So we use chroot:
# pretend that /mnt/ is our new / and start a bash instance inside [student@host]$ sudo chroot /mnt /bin/bash # change password for current user (root) [ root@jail]$ passwd New password: root Retype new password: root passwd: password updated successfully # while we're here, set Google DNS as primary DNS # qemu has a bug where it refuses to fall back to other resolvers # so it will be hard stuck on 127.0.0.53 ==> can't resolve domain names [ root@jail]$ echo 'nameserver 8.8.8.8' > /etc/resolv.conf # exit from this bash instance and escape from the chroot jail [ root@jail]$ exit # finally, unmount our disk -- we're done with it for now [student@host]$ sudo umount /mnt
Next step is to get the kernel source code and compile it. By separating the kernel from the disk image, we are able to checkout to other branches / commits and test out different versions without installing them anywhere. Normally, you would have to select what options you want included in the compilation process (e.g.: memory allocators, cryptographic systems, etc.) by running make menuconfig
. After finishing your selection and saving the configuration, a .config file would be created. Because we haven't the faintest idea what most of the things enumerated in that menu even are, we will rely on default configurations. One thing that is not part of the default configuration and will be useful later are debug symbols. Go through the following commands and refer to the GIF below when you'll be required to navigate the config menu.
# clone the linux kernel locally [student@host]$ git clone --depth=1 https://github.com/torvalds/linux.git # go into the repo directory [student@host]$ pushd linux # create a default configuration file (.config) [student@host]$ make x86_64_defconfig kvm_guest.config # manually enable debug symbols on top of current .config # NOTE: refer to the GIF below [student@host]$ make menuconfig # optional: check out the generated .config file [student@host]$ less .config # compile the kernel using all cores [student@host]$ make -j $(nproc) # return to the previous direcotry [student@host]$ popd
We are finally here. Let's boot up the VM from our bootstrapped disk image, with our personally compiled Linux kernel.
[student@host]$ sudo qemu-system-x86_64 \ -m 4G \ -smp 1 \ -enable-kvm \ -kernel linux/arch/x86/boot/bzImage \ -drive file=images/ubuntu.raw,format=raw,index=0 \ -append 'root=/dev/sda rw console=ttyS0 nokaslr' \ -nographic
Let us have a look at this command, line by line:
-m 4G
: allocate 4GB of memory (change this as you wish)-smp 1
: use only 1 vCPU; this is recommended for debugging purposes-enable-kvm
: KVM is a Linux kernel module that transforms your operating system intro a bare-metal hypervisor. This is what allows you to run actual virtual machines on Linux. Without it, qemu would try to emulate the system, resulting in worse performance.-kernel …/bzImage
: this specifies the compiled & compressed kernel image to use when booting the virtual machine-drive …
: specifies the disk image to load; note that index=0
will make the VM consider this to be /dev/sda. Adding another drive with index=1
will cause it to be regarded as /dev/sdb.-append …
: these are command line arguments for the kernel (yes, even it has those). root=/dev/sda rw
marks /dev/sda (i.e.: our Ubuntu.raw disk image) as the root device to be mounted onto the root directory (i.e.: /) in read-write mode. console=ttyS0
exposes an UART serial interface to the VM and tells Linux to use it for I/O. nokaslr
tells the kernel to disable address space layout randomization.-nographic
: tells qemu: not to open a separate window for the GUI. In stead, it will take the virtual serial device (which the VM will recognize as ttyS0) and link it to the terminal. So whatever the VM sends via the serial to be printed will end out in your stdout. Whatever you type into stdin will be forwarded to the VM as input.
Under normal circumstances, exit the VM by running poweroff.
After starting the VM and logging in as root (with the password that was set earlier), try finding out the kernel version in both the host and guest operating systems:
# host has the latest Arch Linux kernel (you may have Ubuntu, etc.) [student@host]$ uname -r 5.15.2-arch1-1 # guest has the newest Linux kernel release candidate [ root@guest]$ uname -r 5.16.0-rc2+
The correct way of doing things would be creating a bridge (i.e.: a software layer-2 switch) with brctl, adding a network device to your VM via the -netdev
flag, and attaching it to the newly created bridge. This is a bit overkill for our purpose today. If you ever need to create such a setup, there are plenty of resources available.
Although we said that you should have network access in your VM, there may be a chance that you don't have an IP address assigned. You may need to do this manually:
# list available interfaces (in a colorful fashion) [root@guest]$ ip -c addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff 3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/sit 0.0.0.0 brd 0.0.0.0 # send a DHCP request on your Ethernet interface [root@guest]$ dhclient enp0s3 # check if an IP address was allocated [root@guest]$ ip -c addr show enp0s3 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3 valid_lft 86313sec preferred_lft 86313sec inet6 fec0::5054:ff:fe12:3456/64 scope site dynamic mngtmpaddr valid_lft 86318sec preferred_lft 14318sec inet6 fe80::5054:ff:fe12:3456/64 scope link valid_lft forever preferred_lft forever
Way back when, kernels used to be monolithic, meaning that adding new functionality required recompiling and installing it, followed by a reboot. Today, things are much easier. By using the kmod daemon (man 8 kmod
), users are allowed to load and unload modules (i.e.: kernel object files) on demand, without all the fuss. These modules are C programs that must implement initialization and removal functions that are called automatically. Usually, these functions register / unregister other functions contained in your object with core kernel systems.
We can use lsmod to get a list of all present modules, and modinfo to obtain detailed information about a specific module.
[student@host]$ lsmod ecdh_generic 16384 1 bluetooth [student@host]$ modinfo ecdh_generic | grep description description: ECDH generic algorithm [student@host]$ modinfo bluetooth | grep description description: Bluetooth Core ver 2.22
What we can understand from this is that the Elliptic Curve Diffie-Hellman module is 16384 bytes in size and is used by one other module, via the bluetooth ECDH helper. As you probably noticed, elixir.bootlin.com is a critical resource in navigating the kernel code.
Looking in the skel/01/ directory from our code skeleton, we will find a minimal build environment for our first module. Alas, compiling a kernel module differs from compiling a user space program. But just slightly: kernel-specific headers must be used, user space-specific libraries (e.g.: libc) are generally unavailable (so no printf()) and lastly, the same options that were used to compile the kernel itself must be specified. To this end, the kbuild system was introduced. As you can see, our Makefile invokes its correpsondent from the kernel source directory in /lib/modules/..., which in turn uses the configuration in our Kbuild file. The obj-m variable specifies the name of the final output object file (in this case, test.o). test-objs contains a sequence of dependent object files, so if you split your code across multiple sources, just add them to test-objs. If you have a single source, you can drop test-objs but the kbuild system will expect a test.c file to be present.
Now, let's compile our module, upload it into the kernel, and see what happens:
[student@host]$ make [student@host]$ sudo insmod test.ko [student@host]$ sudo dmesg ... [ 6348.461247] my-first-module: Hello world! [student@host]$ sudo rmmod test [student@host]$ sudo dmesg ... [ 6348.461247] my-first-module: Hello world! [ 6366.635090] my-first-module: Goodbye cruel, cruel world!
Here, we used insmod to upload a .ko kernel object file into the kernel proper and rmmod to remove it. dmesg is a tool that prints the kernel message buffer. Note that there are multiple log levels ranging from debug to emergency. pr_info() is the kernel's printf() variant that corresponds to one of the less urgent levels. dmesg can be configured to squelch messages under a certain level but depending on how your kernel was compiled, some of the more important messages will also be echoed to your terminal.
In this task we are going to add a bug to our initial module. We will do this by applying a diffpatch to our source:
[student@host]$ patch my_first_module.c patches/add_bug.patch
Now, our module has a 50% chance to dereference a NULL pointer every time we try to load it. If this happens, a kernel oops will occur. While no error is truly harmless, an oops is more so than a kernel panic. The difference between the two is that the system can recover from a kernel oops, but not from a kernel panic. The Windows equivalent of a kernel panic would be a Blue Screen Of Death.
Knowing that our module will cause trouble, we should test it inside the VM. In order to do this, we need to recompile it using the Makefile in the Linux repo that we cloned. For this, we overwrite the KDIR variable used in our module's Makefile.
# clean up previously created objects [student@host]$ make clean # recompile the module, but for the kernel used in the VM; not your live kernel [student@host]$ KDIR=$(realpath ../linux/) make
Now, we need to get test.ko onto the VM. First of all, if it's still running, kill it. Next, we are going to once again mount the disk image and copy the kernel object in the root home directory. Doing so on a live partition might be a bit trickier :p
# stop the VM if it's still running [ root@guest]$ poweroff # once again, mount the VM disk image [student@host]$ sudo mount ../images/ubuntu.raw /mnt # copy the module in the VM's root home [student@host]$ sudo cp test.ko /mnt/root # unmount the disk before starting the VM again [student@host]$ sudo umount /mnt
Finally, start up qemu once again and notice that test.ko is in /root/. Try to load it with insmod until you get an error like this:
This info dump may be intimidating at first sight, but it contains all the necessary information to identify the problem:
BUG: kernel NULL pointer dereference, address: 0000000000000000
: the reason behind the oops.#PF: supervisor write access in kernel mode
: when dereferencing the virtual address 0x00, the MMU tried to find the corresponding physical page address, but failed. Remember that #PF
stands for Page Fault.RIP: 0010:init+0x3f/0x70 [test]
: the faulting instruction was located in the test module, at an offset of 0x3f from the start of the init() function, which has a total size of 0x70 bytes.Based on this information (especially the last part), we have a few ways of identifying the exact line of code and instruction where the module crashed. First one up, is addr2line. This tool can convert an address to a source file line number, given that the binary was compiled with debug symbols. We already know that the instruction was located at an offset of 0x3f from the init() function, but where was this function located relative to the beginning of the object? This can be easily discovered by consulting its symbol table with readelf.
# where is init() located relative to the start of the object file? [student@host]$ readelf --symbols test.ko Num: Value Size Type Bind Vis Ndx Name ... 24: 0000000000000000 102 FUNC LOCAL DEFAULT 1 init ... # apparently right at the very start ==> our instruction is at address 0x00 + 0x3f = 0x3f # what line from what source file generated the instruction at address 0x3f? [student@host]$ addr2line --exe test.ko 0x3f /.../my_first_module.c:26
Another way of identifying not only the source code line, but also the instruction is by using a tool that may be familiar to you: objdump. This is a binary file disassembler. Next, we are going to disassemble (-d
) only the .text section (a.k.a. the code section), displaying the instruction mnemonics in Intel syntax (-M intel
) and interlacing the C code that generated these instructions (-S
).
# looking for that elusive 3f offset... [student@host]$ objdump -d -M intel -S test.ko ... /* we have a 50-50 chance to shoot ourselves in the foot */ if (random & 0x80) { 34: 80 7c 24 07 00 cmp BYTE PTR [rsp+0x7],0x0 39: 0f 89 00 00 00 00 jns 3f <init_module+0x3f> *((uint8_t *) NULL) = 0xff; 3f: c6 04 25 00 00 00 00 mov BYTE PTR ds:0x0,0xff 46: ff } else { ...
# is our module still loaded? [root@guest]$ lsmod | grep test test 16384 1 # can we remove the module? [root@guest]$ rmmod test rmmod: ERROR: Module test is in use # looks like the module crashed while in the "Loading" state # the kernel was trying to load it at address 0xffffffffc0304000 [root@guest]$ cat /proc/modules test 20480 1 - Loading 0xffffffffc0304000 (O+)
In this case, the best course of action is to simply reboot, but if you want to risk it, just run rmmod -f
to force unloading the module. Be warned that this is very dangerous. Also, if the module is still there after using the -f
flag, make sure that CONFIG_MODULE_FORCE_UNLOAD was set at compile time.
# check if force unloading the module is an option [root@guest]$ zcat /proc/config.gz | grep CONFIG_MODULE_FORCE_UNLOAD CONFIG_MODULE_FORCE_UNLOAD=y # force unload the kernel module [root@guest]$ rmmod -f test
# color schemes for man pages man() { LESS_TERMCAP_mb=$'\e[1;34m' \ LESS_TERMCAP_md=$'\e[1;32m' \ LESS_TERMCAP_so=$'\e[1;33m' \ LESS_TERMCAP_us=$'\e[1;4;31m' \ LESS_TERMCAP_me=$'\e[0m' \ LESS_TERMCAP_se=$'\e[0m' \ LESS_TERMCAP_ue=$'\e[0m' \ command man "$@" }
Now, source your file to load the new command. man will color certain keywords appropriately.
# update your shell's environment with the man() wrapper [student@host]$ source ~/.bashrc # check out the manual page for iptables [student@host]$ man iptables
iptables is a configuration tool for the kernel packet filter.
The system as a whole provides many functionalities that are grouped by tables: filter, nat, mangle, raw, security. If you want to alter a packet header, you place a rule in the mangle table. If you want to mask the private IP address of an internal host with the external IP address of the default gateway, you place a rule in the nat table. Depending on the table you choose, you will gain or lose access to some chains. If not specified, the default is the filter table.
Chains are basically lists of rules. The five built-in chains are PREROUTING, FORWARD, POSTROUTING, INPUT, OUTPUT. Each of these corresponds to certain locations in the network stack where packets trigger Netfilter hooks (here is the PREROUTING kernel hook as an example – not that hard to add one, right?) For a selected chain, the order in which the rules are evaluated is determined primarily by the priority of their tables and secondarily by the user's discretionary arrangement (i.e.: order in which rules are inserted).
A rule consists of two entities: a sequence of match criteria and a jump target.
The jump target represents an action to be taken. You are most likely familiar with the built-in actions such as ACCEPT or DROP. These actions decide the ultimate fate of the packet and are final (i.e.: rule iteration stops when these are invoked). However, there are also extended actions (see man iptables-extensions(8)
) that are not terminal verdicts and can be used for various tasks such as auditing, forced checksum recalculation or removal of Explicit Congestion Notification (ECN) bits.
The match criteria of every rule are checked to determine if the jump target is applied. The way this is designed is very elegant: every type of feature (e.g.: layer 3 IP address vs layer 4 port) that you can check has a match callback function defined in the kernel. If you want, you can write your own such function in a Linux Kernel Module (LKM) and thus extend the functionality of iptables (Writing Netfilter Modules with code example). However, you will need to implement a userspace shared library counterpart. When you start an iptables process, it searches in /usr/lib/xtables/ and automatically loads certain shared libraries (note: this path can be overwritten or extended using the XTABLES_LIBDIR environment variable). Each library there must do three things:
iptables --help
is called (its help message is an amalgamation of each library's help snippet).So when you want to test the efficiency of the iptables rule evaluation process, keep in mind that each rule may imply the invocation of multiple callbacks such as this.
Before writing our own match module, here's a small task to freshen your memory on how to use iptables.
Write an iptables rule according to the following specifications:
How to test:
$ sudo curl www.google.com $ sudo dmesg
multiport, owner modules
$ man 8 iptables-extensions
xtables is the backbone of iptables and provides a protocol-agnostic infrastructure for adding match modules. We touched on this topic earlier in the exercise but now we're going to jump right in. Following this brief introduction will be three subsections detailing the data structures, the user space shared library and the kernel module. All these are partially implemented in the 02/ directory and you will have specific TODOs in the source files. By the end, you will have implemented a match filter for requested domains in DNS queries (i.e.: you can block DNS queries for “google.com” but not “kernel.org”, for example). Note that you can solve this task either in the VM, or on your localhost. We won't be doing anything dangerous, like overwriting kernel structures.
We mentioned before that an iptables extension has two components. A kernel module implementing the verification of a rule and a user space library that is able to parse the user's rules (in iptables “syntax”). Considering that our module will be named xt_dns_name (the xt_ part conforming to the naming convention), what links these two elements is the include/xt_dns_name.h header. In it, the xt_dns_name_mtinfo structure will hold all information necessary to the kernel module to match a packet. This structure will be initialized in user space and transferred over to the xtables framework by iptables. Looking closer, we notice two fields. While name will hold the queried domain name, flags specifies what features are enabled for verification. Our module is simple and has only one match criteria, but note that iptables can invert a selection by specifying the !
symbol before it. So we can match a packet either on a queried domain name match, or a mismatch.
The plugin consists of a shared library compiled from plugin/libxt_dns_name.c. The source is broken down into four sections:
API
: These are prototypes of functions that must be made available to iptables to invoke when it needs certain things done. Here is a description of each function:dns_name_mt_help()
: When you invoke iptables --help
, each plugin will print its own help message. This is our contribution.dns_name_mt_init()
: Called before the argument parsing begins. Will zero out our xt_dns_name_mtinfo structure.dns_name_mt_parse()
: Based on optarg, this function will be invoked for each argument that has something to do with our plugin. It will update the xt_dns_name_mtinfo structure accordingly on each pass.dns_name_mt_check()
: Final check before sending the structure in kernel space. Will verify that all required arguments were provided.dns_name_mt_print()
: When invoking iptables -L
, this function will print out a rule's match criteria.dns_name_mt_save()
: Given a xt_dns_name_mtinfo structure, this function will print out the CLI arguments that would generate such a function. Called upon by iptables-save.MODULE SPECIFICATION STRUCTURES
: Here we declare two structures:dns_name_mt_opts
: A structure defining the CLI arguments that we accept (see man 3 getopt
for details).dns_name_mt_reg
: A structure containing function pointers for specific tasks that the module needs to accomplish. We initialize it with the function described in the API
section. This kind of structure is usually called a virtual table (or vtable).IPTABLES MODULE CALLBACKS
: this is where we actually implement the functions that we added to the vtable. Here, you will search for TODOs.LIBRARY MANAGEMENT FUNCTIONS
: When loaded, each library (i.e.: a .so file) can have a number of constructors defined. These constructors are functions called upon by the loader once the library was mapped in virtual memory. Our constructor will invoke xtables_register_match() in order to register the vtable with iptables, letting it know that it has yet another plugin at its disposal.
The kernel module source is organized similarly to the user space plugin. The dns_name_mt_reg structure acts as a vtable but also includes information about permissible chains and layer 3 protocols that work with our implementation. Specifically, any rule that makes use of this module can be inserted only in the OUTPUT chain, meaning that we can only catch requests originating from our localhost. Moreover, we implement support only for IPv4, not for IPv6. As we can see, this structure is used on module initialization, in dns_name_mt_init(), to register our module with the xtables framework via xt_register_match().
dns_name_check() and dns_name_mt() implement the functionalities required of our module. The former performs checks on each newly inserted rule, or at least on the part that pertains to this module. In other words, it must make sure that a valid domain name (i.e.: ”.”s replaced with length of following label, etc.) was inserted, for example. The latter function is called upon to verify if a packet matches a certain rule. Its first argument does not represent the packet itself, but a socket buffer structure (see also this, and possibly this) that contains this information, in addition to much, much more. We made sure to provide you with pointers to our xt_dns_name_mtinfo structure, but also to the beginning of the IPv4 header. However, it is up to you to implement this logic and obtain a working match module.
[student@host]$ sudo insmod xt_dns_name.ko # depending on your distro, libxt_*.so may be installed in different places [student@host]$ sudo XTABLES_LIBDIR="$(pwd):/usr/lib/xtables:/usr/lib/x86_64-linux-gnu/xtables" \ iptables \ -m dns_name \ -I OUTPUT \ --domain 'fep.grid.pub.ro' \ -j DROP [student@host]$ dig +short fep.grid.pub.ro @8.8.8.8 [student@host]$ sudo iptables -F OUTPUT [student@host]$ sudo rmmod xt_dns_name
Remember that although your host is most likely little-endian, the Internet is big-endian. So when accessing data that is larger than 1 byte (e.g.: port number), use the htons() family of functions. They should be readily available to you in the kernel module.