# Lab 02 - Memory Monitoring (Linux)

## Objectives

• Offer an introduction to Virtual Memory.
• Get you acquainted with relevant commands and their outputs for monitoring memory related aspects.
• Introduce the concept of page de-duplication.
• Present a step-by-step guide to Intel PIN for dynamic instrumentation.

## Introduction

### 01. Virtual Memory

Virtual memory uses a disk as an extension of RAM so that the effective size of usable memory grows correspondingly. The kernel will write the contents of a currently unused block of memory to the hard disk so that the memory can be used for another purpose. When the original contents are needed again, they are read back into memory. This is all made completely transparent to the user; programs running under Linux only see the larger amount of memory available and don't notice that parts of them reside on the disk from time to time. Of course, reading and writing the hard disk is slower (on the order of a thousand times slower) than using real memory, so the programs don't run as fast. The part of the hard disk that is used as virtual memory is called the swap space.

### 02. Virtual Memory Pages

Virtual memory is divided into pages. Each virtual memory page on the X86 architecture is 4KB. When the kernel writes memory to and from disk, it writes memory in pages. The kernel writes memory pages to both the swap device and the file system.

### 03. Kernel Memory Paging

Memory paging is a normal activity not to be confused with memory swapping. Memory paging is the process of syncing memory back to disk at normal intervals. Over time, applications will grow to consume all of memory. At some point, the kernel must scan memory and reclaim unused pages to be allocated to other applications.

### 04. The Page Frame Reclaim Algorithm (PFRA)

The PFRA is responsible for freeing memory. The PFRA selects which memory pages to free by page type. Page types are listed below:

• Unreclaimable – locked, kernel, reserved pages
• Swappable – anonymous memory pages
• Syncable – pages backed by a disk file

All but the “unreclaimable” pages may be reclaimed by the PFRA. There are two main functions in the PFRA. These include the kswapd kernel thread and the “Low On Memory Reclaiming” function.

### 05. Kswapd

The kswapd daemon is responsible for ensuring that memory stays free. It monitors the pages_high and pages_low watermarks in the kernel. If the amount of free memory is below pages_low, the kswapd process starts a scan to attempt to free 32 pages at a time. It repeats this process until the amount of free memory is above the pages_high watermark.

The kswapd thread performs the following actions:

• If the page is unmodified, it places the page on the free list.
• If the page is modified and backed by a file system, it writes the contents of the page to disk.
• If the page is modified and not backed up by any file system (anonymous), it writes the contents of the page to the swap device.

### 06. Kernel Paging with pdflush

• The pdflush daemon is responsible for synchronizing any pages associated with a file on a filesystem back to disk. In other words, when a file is modified in memory, the pdflush daemon writes it back to disk.
• The pdflush daemon starts synchronizing dirty pages back to the filesystem when 10% of the pages in memory are dirty. This is due to a kernel tuning parameter called vm.dirty_background_ratio.
• The pdflush daemon works independently of the PFRA under most circumstances. When the kernel invokes the LMR (Low on Memory Reclaiming) algorithm, the LMR specifically forces pdflush to flush dirty pages in addition to other page freeing routines.
• The vmstat utility reports on virtual memory usage in addition to CPU usage. The following fields in the vmstat output are relevant to virtual memory: Swapd, Free, Buff, Cache, So, Si, Bo, Bi (use man vmstat to read their description).

The following vmstat output demonstrates heavy utilization of virtual memory during an I/O application spike. The following observations can be made based on this output:

• A large amount of disk blocks are paged in (bi) from the filesystem. This is evident in the fact that the cache of data in process address spaces (cache) grows.
• During this period, the amount of free memory (free) remains steady at 17MB even though data is paging in from the disk to consume free RAM.
• To maintain the free list, kswapd steals memory from the read/write buffers (buff) and assigns it to the free list. This is evident in the gradual decrease of the buffer cache (buff).
• The kswapd process then writes dirty pages to the swap device (so). This is evident in the fact that the amount of virtual memory utilized gradually increases (swpd).

Conclusions:

• The less major page faults on a system, the better response times achieved as the system is leveraging memory caches over disk caches.
• Low amounts of free memory are a good sign that caches are effectively used unless there are sustained writes to the swap device and disk.
• If a system reports any sustained activity on the swap device, it means there is a memory shortage on the system.

### 01. [10p] Memory usage

Open ex01.py and take a look at the code. What is the difference between the methods do_append() and do_allocate()?

Use vmstat to monitor the memory usage while performing the following experiments. In the main method, call:

• The do_append method on it's own (see Experiment 1).
• The do_allocate method on it's own (see Experiment 2).
• Both methods as shown in the Experiment 3 area in the code.
• Both methods as shown in the Experiment 4 area in the code.

Offer an interpretation for the obtained results.

Explain in a paragraph how Generational Garbage Collection works.
In a python shell, get the generational thresholds and the current number of live references in each generation.
Copy+paste the code and the resulting values or add a screenshot.

### 02. [20p] Swap space

Before starting this task, call the assistant to show him your progress. If you manage to freeze your PC, it might prove tricky to do so afterwards.

#### [10p] Task A - Swap File

First, let us check what swap devices we have enabled. Check the NAME and SIZE columns of the following command:

$swapon --show No output means that there are no swap devices available. If you ever installed a Linux distro, you may remember creating a separate swap partition. This, however, is only one method of creating swap space. The other is by adding a swap file. Run the following commands: $ sudo swapoff -a
$sudo dd if=/dev/zero of=/swapfile bs=1024 count=$((4 * 1024 * 1024))
$sudo chmod 600 /swapfile$ sudo mkswap /swapfile
$sudo swapon /swapfile$ swapon --show

Just to clarify what we did:

• disabled all swap devices
• created a 4Gb zero-initialized file
• set the permission to the file so only root can edit it
• created a swap area from the file using mkswap (works on devices too)
• activated the swap area

The new swap area is temporary and will not survive a reboot. To make it permanent, we need to register it in /etc/fstab by adding a line such as this:

/swapfile swap swap defaults 0 0

Now that we created a swap file, what are the advantages / disadvantages when compared to a swap partition?

• easier to manage
• similar performance
• can be affected by disk fragmentation (not the case for a partition)

#### [10p] Task B - Does it work?

In one terminal run vmstat and look at the swpd and free columns.

$vmstat -w 1 In another terminal, open a python shell and allocate a bit more memory than the available RAM. Identify the moment when the newly created swap space is being used. One thing you might notice is that the value in vmstat's free column is lower than before. This does not mean that you have less available RAM after creating the swap file. Remember using the dd command to create a 4GB file? A big chunk of RAM was used to buffer the data that was written to disk. If free drops to unacceptable levels, the kernel will make sure to reclaim some of this buffer/cache memory. To get a clear view of how much available memory you actually have, try running the following command: $ free -h

Observe that once you close the python shell and the memory is freed, swpd still displays a non-zero value. Why? There simply isn't a reason to clear the data from the swap area. If you really want to clean up the used swap space, try the following:

$vmstat$ sudo swapoff -a && sudo swapon -a
$vmstat #### NON-DEMO TASK Create two swap files. Set their priorities to 10 and 20, respectively. Include the commands (copy+paste) or a screenshot of the terminal. ### 03. [30p] Kernel Samepage Merging KSM is a page de-duplication strategy introduced in kernel version 2.6.32. In case you are wondering, it's not the same thing as the file page cache. KSM was originally developed in tandem with KVM in order to detect data pages with exactly the same content and make their page table entries point to the same physical address (marked Copy-On-Write.) The end goal was to allow more VMs to run on the same host. Since each page must be scanned for identical content, this solution had no chance of scaling well with the available quantity of RAM. So, the developers compromised to scan only with the private anonymous pages that were marked as likely candidates via madvise(addr, length, MADV_MERGEABLE). Download the skeleton for this task. #### [15p] Task A - Check kernel support & enable ksmd First things first, you need to verify that KSM was enabled during your kernel's compilation. For this, you need to check the Linux make config build file that is stored on your /boot partition. Hopefully, you should see something like this: $ grep CONFIG_KSM /boot/config-$(uname -r) CONFIG_KSM=y If you don't have KSM enabled, you could recompile the kernel with the CONFIG_KSM flag and try it, but you don't have to :) Moving forward. Next thing on the list is to check that the ksmd daemon is functioning. Any configuration that we'll do will be through the sysfs files in /sys/kernel/mm/ksm. Consequently, you should change user to root (even sudo should not allow you to write to these files.) • /…/run : this is 1 if the daemon is active; write 1 to it if it's not • /…/pages_to_scan : this is how many pages will be scanned before going to sleep; you can increase this to 1000 if you want to see faster results • /…/sleep_millisecs : this is how many ms the daemon sleeps in between scans; since you've modified pages_to_scan, you can leave this be • /…/max_page_sharing : this is the maximum number of pages that can be de-duplicated; in cases like this it's better to go big or go home; so set it to something like 1000000, just to be sure There are a few more files in the ksm/ directory. We will still use one or two later on. But for now, configuring the previous ones should be enough. Google the rest if you're interested. #### [15p] Task B - Watch the magic happen For this step it would be better to have a few terminals open. First, let's start a vmstat. Keep your eyes on the active memory column when we run the sample program. $ vmstat -wa -S m 1

Next would be a good time to introduce two more files from the ksm/ sysfs directory:

• /…/pages_shared : this file reports how many physical pages are in use at the moment
• /…/pages_sharing : this file reports how many virtual page table entries point to the aforementioned physical pages

For this experiment we will also want to monitor the number of de-duplicated virtual pages, so have at it:

#### [10p] Task B - Instrumentation Callbacks

Looking at main(), most Pin API calls are self explanatory. The only one that we're interested in is the following:

INS_AddInstrumentFunction(ins_instrum, NULL);

This call instructs Pin to trap on each instruction in the binary and invoke ins_instrum(). However, this happens only once per instruction. The role of the instrumentation callback that we register is to decide if a certain instruction is of interest to us. “Of interest” can mean basically anything. We can pick and choose “interesting” instructions based on their class, registers / memory operands, functions or objects containing them, etc.

Let's say that an instruction has indeed passed our selection. Now, we can use another Pin API call to insert an analysis routine before or after said instruction. While the instrumentation routine will never be invoked again for that specific instruction, the analysis routine will execute seamlessly for each pass.

For now, let us observe only the instrumentation callback and leave the analysis routine registration for the following task. Take a look at ins_instrum(). Then, compile the tool and run any program you want with it. Waiting for it to finish is not really necessary. Stop it after a few seconds.

$make$ ./third_party/pin-3.13/pin -t obj-intel64/minspect.so -- ls -l 1>/dev/null

Just to make sure everything is clear: the default rule for make will generate an obj-intel64/ directory and compile the tool as a shared object. The way to start a process with our tool's instrumentation is by calling the pin util. -t specifies the tool to be used. Everything after -- should be the exact command that would normally be used to start the target process.

Note: here, we output information to stderr from our instrumentation callback. This is not good practice. The Pin tool and the target process share pretty much everything: file descriptors, virtual memory, etc. Normally, you will want to output these things to a log file. However, let's say we can get away with it for now, under the pretext of convenience.

Click to display ⇲

Click to hide ⇱

Going forward, we got rid of some of the clutter in ins_instrum(). As you may have noticed, the most recent addition to this routine is the for iterating over the memory operands of the instruction. We check whether each operand is the source of a read using INS_MemoryOperandIsRead(). If this check succeeds, we insert an analysis routine before the current instruction using INS_InsertPredicatedCall(). Let's take a closer look at how this API call works:

INS_InsertPredicatedCall(
IARG_PTR,           strdup(ins_disass.c_str()),
IARG_MEMORYOP_EA,   op_idx,
IARG_END);

The first three parameters are:

• ins: reference to the INS argument passed to the instrumentation callback by default.
• IPOINT_BEFORE: instructs to insert the analysis routine before the instruction executes (see Instrumentation arguments for more details.)
• read_analysis: the function that is to be inserted as the analysis routine.

Next, we pass the arguments for read_analysis(). Each argument is represented by a type macro and the actual value. When we don't have any more parameters to send, we end by specifying IARG_END. Here are all the arguments:

• IARG_ADDRINT, ins_addr: a 64-bit integer containing the absolute address of the instruction.
• IARG_PTR, strdup(ins_disass.c_str()): all objects in the callback's local context will be lost after we return; thus, we need to duplicate the disassembled code's string and pass a pointer to the copy.
• IARG_MEMORYOP_EA, op_idx: effective address of a specific memory operand; so this argument is not passed by value, but in stead recalculated each time and passed to the analysis routine seamlessly.
• IARG_MEMORYREAD_SIZE: size in bytes of the memory read; check the documentation for some important exceptions.

Take a look at what read_analysis() does. Recompile the tool and run it again (just as in task B). Finally, apply Task-C.patch and move on to the next task.

Click to display ⇲

Click to hide ⇱

#### [10p] Task D - Analysis Callbacks (Write)

For the memory write analysis routine, we need to add instrumentation both before and after each instruction. The former needs to save the original buffer state while the latter displays the information in its entirety. Assuming that there are more than one memory locations that are written to, we push the initial buffer state hexdumps to a stack. Consequently, we need to add the post-write instrumentation in reverse order to ensure that the succession of elements popped from the stack is correct. Let's take a look at the pre-write instrumentation insertion:

INS_InsertPredicatedCall(
ins, IPOINT_BEFORE, (AFUNPTR) pre_write_analysis,
IARG_CALL_ORDER,    CALL_ORDER_FIRST + op_idx + 1,
IARG_MEMORYOP_EA,   op_idx,
IARG_MEMORYWRITE_SIZE,
IARG_END);

We notice a new set of parameters:

• IARG_CALL_ORDER, CALL_ORDER_FIRST + op_idx + 1,: specifies the call order when multiple analysis routines are registered; see CALL_ORDER enum's documentation for details.

Recompile the tool. Test to see that the write analysis routines work properly. Apply Task-D.patch and let's move on to applying the finishing touches.

Click to display ⇲

Click to hide ⇱

#### [5p] Task E - Finishing Touches

This is only a minor addition. Namely, we want to add a command line option -i that can be used multiple times to specify multiple image names (e.g.: ls, libc.so.6, etc.) The tool must forego instrumentation for any instruction that is not part of these objects. As such, we declare a Pin KNOB:

static KNOB<string> knob_img(KNOB_MODE_APPEND,    "pintool",
"i", "", "names of objects to be instrumented for branch logging");

We should not use argp or other alternatives. In stead, let Pin use its own parser for these things. knob_img will act as an accumulator for any argument passed with the flag -i. Observe it's usage in ins_instrum().

Determine the shared object dependencies of your target binary of choice. Then try to recompile and rerun the Pin tool while specifying some of them as arguments.

\$ ldd /bin/ls
linux-vdso.so.1 (0x00007ffd0d19b000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f32df185000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f32ded94000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f32deb90000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007f32de6ff000)
/lib64/ld-linux-x86-64.so.2 (0x00007f32df7d6000)

This concludes the tutorial. The resulting Pin tool can now be used as a starting point for developing a Taint analysis engine. Discuss more with your lab assistant if you're interested.