This is an old revision of the document!
“Every thing is a file”, is a very famous Linux philosophy. There is a reason for this philosophy.
It's because, Linux operating system considers and works with most of its devices, by the same way a file is opened or closed.
# /usr/bin/time –v date.
Once memory pages are mapped into the buffer cache, the kernel will attempt to use these pages resulting in a Minor Page Fault (MnPF). A MnPF saves the kernel time by reusing a page in memory as opposed to placing it back on the disk.
To find out how many MPF and MnPF occurred when an application starts, the time command can be used:
# /usr/bin/time –v evolution.
Each time an application issues an I/O, it takes an average of 8MS to service that I/O on a 10K RPM disk. Since this is a fixed time, it is imperative that the disk be as efficient as possible with the time it will spend reading and writing to the disk. The amount of I/O requests is often measured in I/Os Per Second (IOPS). The 10K RPM disk has the ability to push 120 to 150 (burst) IOPS. To measure the effectiveness of IOPS, divide the amount of IOPS by the amount of data read or written for each I/O.
Sequential I/O - The iostat command provides information on IOPS and the amount of data processed during each I/O. Use the –x switch with iostat (iostat –x 1). Sequential workloads require large amounts of data to be read sequentially and at once. These include applications such as enterprise databases executing large queries and streaming media services capturing data. With sequential workloads, the KB per I/O ratio should be high. Sequential workload performance relies on the ability to move large amounts of data as fast as possible. If each I/O costs time, it is imperative to get as much data out of that I/O as possible.
Random I/O - Random access workloads do not depend as much on size of data. They depend primarily on the amount of IOPS a disk can push. Web and mail servers are examples of random access workloads. The I/O requests are rather small. Random access workload relies on how many requests can be processed at once. Therefore, the amount of IOPS the disk can push becomes crucial.
The following vmstat output demonstrates a system under memory distress. It is writing data out to the swap device:
To see the effect the swapping to disk is having on the system, check the swap partition on the drive using iostat.
average IOPS = 1 / (average latency in ms + average seek time in ms).
Let's calculate the Rotational Delay - RD for a 10K RPM drive:
10000/60 = 166 RPS
1/166 = 0.006 seconds per Rotation
6/2 = 3 MS
3 MS + 3 MS = 6 MS
6 MS + 2 MS = 8 MS
1000/8 = 125 IOPS
Add in your archive the operations and the result you obtained. (Screenshot, picture of calculations made by hand on paper)
Calculate the Rotational Delay, and then the IOPS for a 5400 RPM drive.
$ iostat -xdm
Use iostat with -p for specific device statistics:
$ iostat -xdm -p sda
Add in your archive screenshot or pictures of the operations and the result you obtained, also showing the output of iostat from which you took the values.
$ df -kh /dev/loop*
Debian/Ubuntu Linux install iotop $ sudo apt-get install iotop How to use iotop command $ sudo iotop OR $ iotop
Supported options by iotop command:
Options | Description | |
–version | show program’s version number and exit | |
-h, –help | show this help message and exit | |
-o, –only | only show processes or threads actually doing I/O | |
-b, –batch | non-interactive mode | |
-n NUM, –iter=NUM | number of iterations before ending [infinite] | |
-d SEC, –delay=SEC | delay between iterations [1 second] | |
-p PID, –pid=PID | processes/threads to monitor [all] | |
-u USER, –user=USER | users to monitor [all] | |
-P, –processes | only show processes, not all threads | |
-a, –accumulated | show accumulated I/O instead of bandwidth | |
-k, –kilobytes | use kilobytes instead of a human friendly unit | |
-t, –time | add a timestamp on each line (implies –batch) | |
-q, –quiet | suppress some lines of header (implies –batch) |
Provide a screenshot in which it shows the iotop with only the active processes and one of them being the running script. Then another screenshot after you succeeded to kill it.
Linux allows you to use part of your RAM as a block device, viewing it as a hard disk partition. The advantage of using a RAM disk is the extremely low latency (even when compared to SSDs). The disadvantage is that all contents will be lost after a reboot.
Before getting started, let's find out the file system that our root partition uses. Run the following command (T - print file system type, h - human readable):
$ df -Th
The result should look like this:
Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 1.1G 0 1.1G 0% /dev tmpfs tmpfs 214M 3.8M 210M 2% /run /dev/sda1 ext4 218G 4.1G 202G 2% / <- root partition tmpfs tmpfs 1.1G 252K 1.1G 1% /dev/shm tmpfs tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs tmpfs 1.1G 0 1.1G 0% /sys/fs/cgroup /dev/sda2 ext4 923M 73M 787M 9% /boot /dev/sda4 ext4 266G 62M 253G 1% /home
From the results, we will assume in the following commands that the file system is ext4. If it's not your case, just replace with what you have:
$ sudo mkdir /mnt/ramdisk $ sudo mount -t tmpfs -o size=1G ext4 /mnt/ramdisk
tmpfs /mnt/ramdisk tmpfs rw,nodev,nosuid,size=1G 0 0
That's it. We just created a 1Gb tmpfs ramdisk with an ext4 file system and mounted it at /mnt/ramdisk. Use df again to check this yourself.
As we mentioned before, you can't get I/O statistics regarding tmpfs since it is not a real partition. One solution to this problem is using pv to monitor the progress of data transfer through a pipe. This is a valid approach only if we consider the disk I/O being the bottleneck.
Next, we will generate 512Mb of random data and place it in /mnt/ramdisk/file first and then in /home/student/file. The transfer is done using dd with 2048-byte blocks.
$ pv /dev/urandom | dd of=/mnt/ramdisk/rand bs=2048 count=$((512 * 1024 * 1024 / 2048)) $ pv /dev/urandom | dd of=/home/student/rand bs=2048 count=$((512 * 1024 * 1024 / 2048))
Look at the elapsed time and average transfer speed. What conclusion can you draw?
Put one screenshot with the tmpfs partition in df output and one screenshot of both pv commands and write your conclusion.
The purpose of this exercise is to identify where bottlenecks appear in a real-world application. For this we will use perf and American Fuzzy Lop (AFL).
afl is a fuzzing tool. Fuzzing is the process of detecting bugs empirically. Starting from a seed input file, a certain program is executed and its behavior observed. The meaning of “behavior” is not fixed, but in the simplest sense, let's say that it means “order in which instructions are executed”. After executing the binary under test, the fuzzer will mutate the input file. Following another execution, with the updated input, the fuzzer decides whether or not the mutations were useful. This determination is made based on deviations from known paths during runtime. Fuzzers usually run over a period of days, weeks, or even months, all in the hope of finding an input that crashes the program.
First, let's compile AFL and all related tools. We initialize / update a few environment variables to make them more accessible. Remember that these are set only for the current shell.
$ git clone https://github.com/google/AFL $ pushd AFL $ make -j $(nproc) $ export PATH="${PATH}:$(pwd)" $ export AFL_PATH="$(pwd)" $ popd
Now, check that it worked:
$ afl-fuzz --help $ afl-gcc --version
The program under test will be fuzzgoat, a vulnerable program made for the express purpose of illustrating fuzzer behaviour. To prepare the program for fuzzing, the source code has to be compiled with afl-gcc. afl-gcc is a wrapper over gcc that statically instruments the compiled program. This analysis code that is introduced is leveraged by afl-fuzz to track what branches are taken during execution. In turn, this information is used to guide the input mutation procedure.
$ git clone https://github.com/fuzzstati0n/fuzzgoat.git $ pushd fuzzgoat $ CC=afl-gcc make $ popd
If everything went well, we finally have our instrumented binary. Time to run afl. For this, we will use the sample seed file provided by fuzzgoat. Here is how we call afl-fuzz:
-i
flag specifies the directory containing the initial seed-o
flag specifies the active workspace for the afl instance--
separates the afl flags from the binary invocation command--
separator is how the target binary would normally be invoked in bash; the only difference is that the input file name will be replaced by @@
$ afl-fuzz -i fuzzgoat/in -o afl_output -- ./fuzzgoat/fuzzgoat @@
If you look in the afl_output/ directory, you will see a few files and directories; here is what they are:
@@
in the program invocation.
Next, we will analyze the performance of afl. Using perf, we are able to specify one or more events (see man perf-list(1)
) that the kernel knows to record only when our program under test (in this case afl) is running. When the internal event counter reaches a certain value (see the -c
and -F
flags in man perf-record(1)
), a sample is taken. This sample can contain different kinds of information; for example, the -g
option requires the inclusion of a backtrace of the program with every sample.
Let's record some stats using unhalted CPU cycles as an event trigger, every 1k events in userspace, and including frame pointers in samples:
$ perf record -e cycles -c 1000 -g --all-user \ afl-fuzz -i fuzzgoat/in -o afl_output -- ./fuzzgoat/fuzzgoat @@
$ sudo su $ echo -1 > /proc/sys/kernel/perf_event_paranoid $ exit
More information can be found here.
Leave the process running for a minute or so; then kill it with <Ctrl + C>. perf will take a few moments longer to save all collected samples in a file named perf.data, which are read by perf script. Don't mess with it!
Let's see some raw trace output first. Then look at the perf record. The record aggregates the raw trace information and identifies stress areas.
$ perf script -i perf.data $ perf report -i perf.data
Use perf script
to identify the PID of afl-fuzz (hint: -F
). Then, filter out any samples unrelated to afl-fuzz (i.e.: its child process, fuzzgoat) from the report. Then, identify the most heavily used functions in afl-fuzz. Can you figure out what they do from the source code?
Make sure to include plenty of screenshots and explanations for this task :p
A Flame Graph is a graphical representation of the stack traces captured by the perf profiler during the execution of a program. It provides a visual depiction of the call stack, showing which functions were active and how much time was spent in each one of them. By analyzing the flame graph generated by perf, we can identify performance bottlenecks and pinpoint areas of the code that may need optimization or further investigation.
When analyzing flame graphs, it's crucial to focus on the width of each stack frame, as it directly indicates the number of recorded events following the same sequence of function calls. In contrast, the height of the frames does not carry significant implications for the analysis and should not be the primary focus during interpretation.
Using the samples previously obtained in perf.data, generate a corresponding Flame Graph in SVG format and analyze it.
Please take a minute to fill in the feedback form for this lab.